Getting Started with Slurm on ERISTwo

Get Help

Posted on December 12, 2022

Updated on

May 5, 2025

Getting Started with Slurm on ERISTwo

Queuing system (Slurm)

SLURM (Simple Linux Universal Resource Manager) is a scheduler that allocates resources for submitted compute jobs. Similarly to ERIStwo/LSF this is a free service to all users.

Partitions

SLURM's partitions are similar to the ‘queues’ of the LSF job scheduler previously deployed on the Scientific Computing (SciC) Linux Clusters. Each partition has its own dedicated resources such as the number of nodes, run time, GPU, CPU, memory, etc.

To view the list of available partitions, execute the command:

$ sinfo -s

Partition	Max Time	Max Cores	Memory
Short	3 h	80	< 360 GB
Normal	120 h	80	< 360 GB
Bigmem	192 h	88	360 GB - 1 TB
Long	384 h	80	< 360 GB
Filemove	-	8	< 130 GB

There are some other groups partitions, you will only have visibility to the ones you are authorized for. For additional info on a specific partition, execute the command:

$ sinfo -p <partition_name>

Submitting jobs

Most commonly, SLURM jobs are submitted in batch mode using the command 'sbatch' and an input script file consisting of SBATCH flags specified at the top of the file. SLURM jobs accept the following flags to request resources:

Job Name	#SBATCH --job-name=My-Job_Name
Wall time hours	#SBATCH --time=24:0:0 or -t[days-hh:min:sec]
Number of nodes	#SBATCH --nodes=1
Number of proc per node	#SBATCH --ntasks-per-node=24
Number of cores per task	#SBATCH --cpus-per-task=24
Number of GPU	#SBATCH --gpus=3
Send mail at end of the job	#SBATCH --mail-type=end
User's email address	#SBATCH --mail-user=@email
Working Directory	#SBATCH --workdir=dir-name
Memory Size	#SBATCH --mem=[mem \|M\|G\|T] or --mem-per-cpu
Partition	#SBATCH --partition=[name]
Job Arrays	#SBATCH --array=[array_spec]

and where more details about each of the settings can be found here. In SLURM each task represents a collection of resources where, for example, the user can specify the number of CPUs per task and memory per task. The main constraint is that all tasks must run on a single node (e.g. not claim more CPUs than are available on a single node).

To submit a job script:

$ sbatch <path-to-script>

After submitting your jobs, always check that your jobs have been submitted successfully.

Check job queue status:

$ squeue

Or for a list of recently submitted user jobs:

$ sacct

Check the job in detail:

$ scontrol show job <job_ID>

Slurm Job status, code, and explanation

When you request status information on your job you can get one of the following:

COMPLETED	CD	The job has been completed successfully.
COMPLETING	CG	The job is finishing but some processes are still active.
FAILED	F	The job terminated with a non-zero exit code and failed to execute.
PENDING	PD	The job is waiting for resource allocation. It will eventually run.
PREEMPTED	PR	The job was terminated because of preemption by another job.
RUNNING	R	The job currently is allocated to a node and is running.
SUSPENDED	S	A running job has been stopped with its cores released to other jobs.
STOPPED	ST	A running job has been stopped with its cores retained.

The job can be canceled or killed; execute the command:

$ scancel <jobID>

Handy LSF to Slurm Reference

The table below provides a convenient reference to assist users in the transition from LSF to SLURM. Please note that some specifications are not always 'one-to-one' (e.g. the specification of tasks as noted above).

Commands

LSF	Slurm	Description
bsub < script_file	sbatch script_file	Submit a job from script_file
bkill 123	scancel 123	Cancel job 123
bjobs	squeue	List user's pending and running jobs
bqueues	sinfo sinfo -s	Cluster status with partition (queue) list With '-s' a summarised partition list, which is shorter and simpler to interpret.

Job Specification

LSF	Slurm	Description
#BSUB	#SBATCH	Scheduler directive
-q queue_name	-p queue_name	Queue to 'queue_name'
-n 64	-n 64	Processor count of 64
-W [hh:mm:ss]	-t [minutes] or -t [days-hh:mm:ss]	Max wall run time
-o file_name	-o file_name	STDOUT output file
-e file_name	-e file_name	STDERR output file
-J job_name	--job-name=job_name	Job name
-x	--exclusive	Exclusive node usage for this job - i.e. no other jobs on same nodes
-M 128	--mem-per-cpu=128M or --mem-per-cpu=1G	Memory requirement
-R "span[ptile=16]"	--tasks-per-node=16	Processes per node
-P proj_code	--account=proj_code	Project account to charge job to
-J "job_name[array_spec]"	--array=array_spec	Job array declaration

Job Environment Variables

LSF	Slurm	Description
$LSB_JOBID	$SLURM_JOBID	Job ID
$LSB_SUBCWD	$SLURM_SUBMIT_DIR	Submit directory
$LSB_JOBID	$SLURM_ARRAY_JOB_ID	Job Array Parent
$LSB_JOBINDEX	$SLURM_ARRAY_TASK_ID	Job Array Index
$LSB_SUB_HOST	$SLURM_SUBMIT_HOST	Submission Host
$LSB_HOSTS $LSB_MCPU_HOST	$SLURM_JOB_NODELIST	Allocated compute nodes
$LSB_DJOB_NUMPROC	$SLURM_NTASKS (mpirun can automatically pick this up from Slurm, it does not need to be specified)	Number of processors allocated
	$SLURM_JOB_PARTITION	Queue

note that similarly to LSF, %j can be used to insert the jobID into the name of the output and error log files (see Example 2 below).

Types of SLURM Job Submissions

In addition to 'sbatch' SLURM also offers other modes of job submission which can be summarised as follows:

i) sbatch – Submit a script for later execution (batch mode)

ii) salloc – Create job allocation and start a shell to use it (interactive mode)

iii) srun – Create a job allocation (if needed) and launch a job step within an sbatch or interactive (salloc) job, for example.

● srun can use a subset or all of the job's resources.

iv) sattach – Connect stdin/out/err for an existing job step

Example SLURM Job Submission

EXAMPLE 1: Interactive JOB

An interactive session on one of the compute nodes can most conveniently be initiated by invoking the following:

srun --pty -p interactive /bin/bash

and where additional resources (e.g. memory, cpus etc) can be specified using the settings indicated in the table above. Alternatively,

an interactive session can be initiated by using salloc.

EXAMPLE 2: SINGLE CPU JOB

In the example below, a simple Python program is submitted as a SLURM job to the ‘short’ partition using ‘sbatch <file_name>’ where 1 CPU and 1GB of memory are requested.

#!/bin/bash

#SBATCH --job-name=single_cpu_example

#SBATCH --partition=short

#SBATCH --ntasks=1

#SBATCH --cpus-per-task=1

#SBATCH –-mem-per-cpu=1G

#SBATCH --output=log%J.out

#SBATCH --error=log%J.err

# Load conda with the module of miniforge3 distribution

module load miniforge3

# Run a demo python program on a single CPU

python hello.py

EXAMPLE 3: MULTIPLE CPU JOB

In the example below, a Python program is submitted as a SLURM job to the ‘normal’ partition using ‘sbatch <file_name>' where 4 CPUs and 1GB of memory per CPU are requested.

#!/bin/bash

#SBATCH --job-name=multi_cpu_example

#SBATCH --partition=normal

#SBATCH --ntasks=4

#SBATCH --cpus-per-task=1

#SBATCH --mem-per-cpu=1

#SBATCH --output=log%J.out

#SBATCH --error=log%J.err

# Load conda with the module of miniforge3 distribution

module load miniforge3

# Run a python program on 4 CPUs that splits the iterations of the loop among each CPU

python square.py

EXAMPLE 4: SINGULARITY JOB

In the example below, a Python program is run upon the startup of a singularity container and is submitted as a SLURM job to the ‘short’ partition using ‘sbatch <file_name>’.

#!/bin/bash

#SBATCH --job-name=singularity_example

#SBATCH --partition=short

#SBATCH –-ntasks=1

#SBATCH --mem-per-cpu=2

#SBATCH --output=singularity-log%J.out

#SBATCH --error=singularity-log%J.err

# Check available versions of singularity with ‘module avail singularity’

module load singularity

# Run application using singularity

singularity exec conda-miniforge3.sif python3 pyTest.py

More details about Singularity on the ERIS cluster can be found here.

EXAMPLE 5: ARRAY JOB USING R

In the example below, an R script is submitted as a SLURM job to the ‘filemove’ partition using ‘sbatch <file_name>'.

#!/bin/bash

#SBATCH --job-name=hello-parallel-test

#SBATCH --partition=filemove

#SBATCH --array=1-5

#SBATCH –-ntasks=2

#SBATCH --mem-per-cpu=1

#SBATCH --output=hello-%j-%a.out

#SBATCH --error=hello-%j-%a.err

# Check available versions of R using ‘module avail R’

module load R/3.5.1-foss-2018b

# Run application using R passing in the array ID, corresponding to $SLURM_ARRAY_TASK_ID, as a command line argument

Rscript hello-parallel.R $SLURM_ARRAY_TASK_ID

The example below is the simple R script used in example 4.

hello-parallel.R

#!/usr/bin/env Rscript

args = commandArgs(trailingOnly=TRUE)

print(paste0('Hello! I am task number: ', args[1]))

vector = c(1, 10, 100, 1000, 10000)

multiply = function(x, y) {

return(x*y)

}

num = as.integer(args[1])

res = multiply(vector[num],num)

print(paste0('Result: ', res))

Go to KB0039912 in the IS Service Desk

Getting Started with Slurm on ERISTwo

Get Help

Getting Started with Slurm on ERISTwo

Queuing system (Slurm)

Partitions

Submitting jobs

Slurm Job status, code, and explanation

Handy LSF to Slurm Reference

Commands

Job Specification

Job Environment Variables

Types of SLURM Job Submissions

Example SLURM Job Submission

Related articles