ERISXdl Getting Started

To access ERISXdl Platform you need to be familiar with Linux and command line interface. You can ssh within the Mass General Brigham network or via VPN to:

$ ssh <username>

Select and Prepare the Application Container 

There are several prepared Docker containers that have been optimized by NVIDIA for the ERISXdl environment. You can see the full list of available images with the command:
$ podman image ls 
Select an application container, examples:
  • rapidsai
  • tensorflow
  • pytorch
  • clara

Or bring your own application container. 

You will need to prepare your container for deployment on ERISXdl and run the container image to be accessible. (This step currently requires working with an admin to prepare and deploy the docker image. We are working to open the Docker deployment for users.)
Once your application container or environment (like Conda) is setup you can prepare the bash script to run it from Slurm. The resource manager is SLURM; here are some instructions to get familiar with this scheduler.

Queuing system (Slurm)

Slurm (Simple Linux Universal Resource Manager) is a scheduler that allocates resources to the submitted job; therefore, all jobs should be submitted through the SLURM scheduler system.


Slurm’s partitions are similar to ‘queues’ in other job schedulers.  Each partition has its dedicated resources such as the number of nodes, run time, GPU, CPU, memory, etc.

To view the list of available partitions, execute the command:

$ sinfo

A summary of the partitions
Please remember that except for the Basic partition, all others require a group and fund number registration to be able to send jobs to them.

  • Basic (Free tier)

    • 1 GPU
    • 10 min
    • 30G Memory
    • Interactive
  • Short

    • 1 GPU
    • 1 hour
    • 60G Memory
  • Medium

    • 2 GPU
    • 4 hours
    • 100G Memory
  • Long

    • 4 GPU
    • 10 hours
    • 100G Memory
  • Mammut

    • 8 GPU
    • 2 weeks
    • 400G Memory

For additional info on a specific partition, execute command:

$ spart <partition_Name>

SLURM job accepts the following flags to request resources:

Job Name   #SBATCH --job-name=My-Job_Name
Wall time hours  #SBATCH --time=24:0:0   or -t[days-hh:min:sec]
Number of nodes   #SBATCH --nodes=1
Number of proc per node   #SBATCH --ntasks-per-node=24
Number of cores per task   #SBATCH --cpus-per-task=24
Number of GPU #SBATCH --gpus=3
Send mail at end of the job #SBATCH --mail-type=end
User's email address   #SBATCH
Working Directory  #SBATCH --workdir=dir-name
Job Restart  #SBATCH --requeue
Share Nodes  #SBATCH --shared
Dedicated nodes  #SBATCH --exclusive
Memory Size    #SBATCH --mem=[mem |M|G|T] or --mem-per-cpu
Account to Charge   #SBATCH --account=[account]
Quality of Service #SBATCH --qos=[name]
Job Arrays    #SBATCH --array=[array_spec]
Use specific resource  #SBATCH --constraint="XXX"

Simple example

Job name:

#SBATCH --job-name=test-job
#SBATCH --output=/PHShome/UserName/DGX_Slurm/test-job.log
#SBATCH --mail-type=end
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=3
#SBATCH --mem-per-cpu=100
#SBATCH --partition=Long 
#SBATCH --qos=Long

# Set Docker Image
export KUBE_IMAGE=registry.local:31500/slurm-job-test:latest

# mounted to /workspace inside the container
export KUBE_WORK_VOLUME=/PHShome/UserName/DGX_Slurm

#Your code goes here:

#Required wrapper
srun /data/erisxdl/kube-slurm/wrappers/

Finally, submit job

$ sbatch

After submitting your jobs, always check that your jobs have been submitted successfully.

Check job status:


Check job in detail

$scontrol show job <job_ID>

Slurm Job status, code, and explanation

When you request status information of your job you can get one of the following:



The job has completed successfully.



The job is finishing but some processes are still active.



The job terminated with a non-zero exit code and failed to execute.



The job is waiting for resource allocation. It will eventually run.



The job was terminated because of preemption by another job.



The job currently is allocated to a node and is running.



A running job has been stopped with its cores released to other jobs.



A running job has been stopped with its cores retained.


The job can be canceled or killed; execute the command:

$ scancel <jobID>

Common commands in Slurm vs. LSF







Submit job



List queues

spart <partition_name>

bqueues -l <queue name>

View queue in details


bjobs -u all

List all jobs status

scontrol show job  <jobid>

bjobs -l <jobID>

Check job in details



Cancel or kill job


If you have any questions, please contact us at