ERISXdl Getting Started

To access ERISXdl Platform you need to be familiar with Linux and command line interface. You can ssh within the Mass General Brigham network or via VPN to:

$ ssh <username>@erisxdl.partners.org

Select and Prepare the Application Container 

There are several prepared Docker containers that have been optimized by NVIDIA for the ERISXdl environment. You can see the full list of available images with the command:
$ podman image ls 
Select an application container, examples:
  • rapidsai
  • tensorflow
  • pytorch
  • clara

Or bring your own application container. 

You will need to prepare your container for deployment on ERISXdl and run the container image to be accessible. (This step currently requires working with an admin to prepare and deploy the docker image. We are working to open the Docker deployment for users.)
Once your application container or environment (like Conda) is setup you can prepare the bash script to run it from Slurm. The resource manager is SLURM; here are some instructions to get familiar with this scheduler.

Queuing system (Slurm)

Slurm (Simple Linux Universal Resource Manager) is a scheduler that allocates resources to the submitted job; therefore, all jobs should be submitted through the SLURM scheduler system.

Partitions

Slurm’s partitions are similar to ‘queues’ in other job schedulers.  Each partition has its dedicated resources such as the number of nodes, run time, GPU, CPU, memory, etc.

To view the list of available partitions, execute the command:

$ sinfo

A summary of the partitions
Please remember that except for the Basic partition, all others require a group and fund number registration to be able to send jobs to them.

  • Basic (Free tier)

    • 1 GPU
    • 10 min
    • 30G Memory
    • Interactive
  • Short

    • 1 GPU
    • 1 hour
    • 60G Memory
  • Medium

    • 2 GPU
    • 4 hours
    • 100G Memory
  • Long

    • 4 GPU
    • 10 hours
    • 100G Memory
  • Mammut

    • 8 GPU
    • 2 weeks
    • 400G Memory

For additional info on a specific partition, execute command:

$ spart <partition_Name>

SLURM job accepts the following flags to request resources:

Job Name   #SBATCH --job-name=My-Job_Name
Wall time hours  #SBATCH --time=24:0:0   or -t[days-hh:min:sec]
Number of nodes   #SBATCH --nodes=1
Number of proc per node   #SBATCH --ntasks-per-node=24
Number of cores per task   #SBATCH --cpus-per-task=24
Number of GPU #SBATCH --gpus=3
Send mail at end of the job #SBATCH --mail-type=end
User's email address   #SBATCH --mail-user=userid@mgb.edu
Working Directory  #SBATCH --workdir=dir-name
Job Restart  #SBATCH --requeue
Share Nodes  #SBATCH --shared
Dedicated nodes  #SBATCH --exclusive
Memory Size    #SBATCH --mem=[mem |M|G|T] or --mem-per-cpu
Account to Charge   #SBATCH --account=[account]
Quality of Service #SBATCH --qos=[name]
Job Arrays    #SBATCH --array=[array_spec]
Use specific resource  #SBATCH --constraint="XXX"

Simple example

Job name: test1.sh

#!/bin/bash
#SBATCH --job-name=test-job
#SBATCH --output=/PHShome/UserName/DGX_Slurm/test-job.log
#SBATCH --mail-type=end
#SBATCH --mail-user=user_email@mgb.org
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=3
#SBATCH --mem-per-cpu=100
#SBATCH --partition=Long 
#SBATCH --qos=Long

# Set Docker Image
export KUBE_IMAGE=registry.local:31500/slurm-job-test:latest

# mounted to /workspace inside the container
export KUBE_WORK_VOLUME=/PHShome/UserName/DGX_Slurm

#Your code goes here:
hostname
uptime

#Required wrapper
srun /data/erisxdl/kube-slurm/wrappers/kube-slurm-image-job.sh

Finally, submit job test1.sh:

$ sbatch test1.sh

After submitting your jobs, always check that your jobs have been submitted successfully.

Check job status:

$squeue

Check job in detail

$scontrol show job <job_ID>

Slurm Job status, code, and explanation

When you request status information of your job you can get one of the following:

COMPLETED

CD

The job has completed successfully.

COMPLETING

CG

The job is finishing but some processes are still active.

FAILED

F

The job terminated with a non-zero exit code and failed to execute.

PENDING

PD

The job is waiting for resource allocation. It will eventually run.

PREEMPTED

PR

The job was terminated because of preemption by another job.

RUNNING

R

The job currently is allocated to a node and is running.

SUSPENDED

S

A running job has been stopped with its cores released to other jobs.

STOPPED

ST

A running job has been stopped with its cores retained.

 

The job can be canceled or killed; execute the command:

$ scancel <jobID>

Common commands in Slurm vs. LSF

 

Slurm

LSF

Explanation

sbatch

bsub

Submit job

sinfo

bqueues

List queues

spart <partition_name>

bqueues -l <queue name>

View queue in details

squeue

bjobs -u all

List all jobs status

scontrol show job  <jobid>

bjobs -l <jobID>

Check job in details

scancel

bkill

Cancel or kill job

          

If you have any questions, please contact us at hpcsupport@partners.org.