Slurm
Note
This documentation is a work in progress. Any comment or suggestion is welcome in sinfin@iac.es.
To stay informed about updates to "Slurm", tips, etc., or to ask any questions regarding its use, please use the following IAC-Zulip channels: #computing/burros (if you are using Slurm in the "Burros") or #computing/hpc (if you are using Slurm in LaPalma or in TeideHPC).
At the IAC we use the Slurm workload manager (widely used in many research institutes and supercomputing centres) in the "burros", LaPalma and TeideHPC.
Here, we present a general guide on how to use Slurm at IAC machines and some examples of basic usage. For more detailed information, please refer to the official documentation.
Note
Some of the systems may have different configurations or extra commands. Those will be specified in the corresponding section regarding that machine.
Checking the queue
In order to check the list of jobs currently in the queue, you use the command:
squeue
It will print the list of jobs executing or waiting, who is running them and how many resources are using.
Tip
Use squeue --me
to only see the jobs that you have submitted.
Note
In LaPalma, the squeue
command only shows your jobs (i.e. the same as squeue --me
in other machines). To see the list of all jobs in the queue you can use the
command squeue-all
. See the LaPalma section for details.
With squeue
you can get an idea of the usage, but sometimes it can have too
much information.
If you simply want to check how many CPUs are currently available in a machine:
$ sinfo -o %C
CPUS(A/I/O/T)
136/56/0/192
In this case, 136 cores are Allocated, 56 Idle from a Total of 192.
Batch jobs
To run scripts in Slurm you must provide a "batch script",
which is submitted using sbatch
.
The script starts with information about the allocation and then is a
simple bash script that can perform any action you like (compile, run,
analyse data,...). To submit a job defined in a file, e.g., slurm.job
,
run
$ sbatch slurm.job
Submitted batch job 245418 #<--- this is the <jobid>
When a job is submitted, it enters the queue. It will then be executed when there are enough computational resources available and there is no other job with higher priority waiting for them. After if finishes or the allocated time ends, the resources will be freed and other jobs in the queue may use them.
To increase the possibility of your job being executed, and to improve the use
of the machine, please set a sensible runtime (--time=hh:mm:ss
).
If you know that your application takes a few hours, do not ask to reserve the
system for a whole day!
GPUs
There are burros that have GPUs installed.
To use them within Slurm, you must request them explicitly using the
--gres=gpu:1
option, where "1" states the number of GPUs requested.
Tip
Please take a look to the default configuration for burros with GPUs, which is described here.
Sample batch scripts
Note
The scripts below try to be as generic as possible, but remember that you should adapt them for the system you are using. For example, IDL is not installed in LaPalma, and some software modules might have different names in LaPalma and in the "Burros", etc.
Attention
Remember to load all the needed modules in you batch scripts!
Python examples
Single-core Python script.
The stardard error and output will be written to the files <jobid>.err and <jobid>.out, respectively. The <jobid> will be shown when you submit your job or when doing
squeue
.#!/bin/bash #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --time=05:00:00 #SBATCH --job-name=YourJob #SBATCH --output=%j.out #SBATCH --error=%j.err module load python python3 your_script.py [args...]
Python multiprocessing script.
Note
In the batch script below we set the variable OMP_NUM_THREADS to 1. While in some corner cases this might not be what you need, in general you should include it when submitting multiprocessing Python jobs with Slurm. The reason for this is the following: Numpy (and also other libraries) is generally configured to be multithreaded, so that a program using it will, by default, execute the Numpy routines in parallel using as many threads as cores available in the machine. Thus, in the example below, without setting OMP_NUM_THREADS, the multiprocessing library would generate 20 processes, and if using Numpy then each process would generate 20 threads when executing Numpy routines. But this would most likely harm the overall performance, since now there would be 400 active threads but Slurm is restricting the job to use only 20 cores. If setting OMP_NUM_THREADS to 1, each of the 20 processes generated by the multiprocessing library would have a full core to itself, generating less
load
in the machine and improving the overall performance of the job.In the case the code uses Python multiprocessing module for paralellization, you need to provide the size of the allocation to the script. We provide a simple example:
#!/bin/bash #SBATCH --ntasks=1 #SBATCH --cpus-per-task=20 #SBATCH --time=05:00:00 #SBATCH --job-name=YourJob #SBATCH --output=%j.out #SBATCH --error=%j.err module load python export OMP_NUM_THREADS=1 python3 my_parallel_app.py $SLURM_CPUS_ON_NODE [other args...]
where
my_parallel_app.py
would be similar to:import sys from multiprocessing import Pool import numpy as np def function(it): import time print('Process = ',it, flush=True) # ====> Here you implement your iterations or anything you like <===== time.sleep(3) return it # ========= if __name__=='__main__': ncpus = int(sys.argv[1]) # available cpus for the job niter = 1000 # total number of iterations print("Number of CPU cores:", ncpus) p = Pool(ncpus) out = p.map(function, range(0,niter,1)) p.close() p.join() print(out)
In the case that you do not care about the order of the results (e.g., analysis of independent images), you can use this example:
import sys from multiprocessing import Pool def callback(result): print(f"The file {result} has been saved!", flush=True) def function(it): import time #print('Process = ',it) # NOTE: the files can not be shared among iterations unless you have # extreme care file_in = f"my_input_file_{it}" file_out = f"my_output_file_{it}" # ====> Here you implement your iterations or anything you like <===== # np.load(file_in) # .... # np.save(file_out) time.sleep(3) return file_out # ========= if __name__=='__main__': ncpus = int(sys.argv[1]) # available cpus for the job niter = 100 # total number of iterations print("Number of CPU cores:", ncpus) p = Pool(ncpus) for i in range(niter): p.apply_async(function, args=(i,), callback=callback) p.close() p.join()
IDL jobs
Running IDL jobs that don't require user interaction is very simple, as shown
with the following "Hello World" example. For this example, we have the
sayhello.pro
and main.pro
source files:
; sayhello.pro
pro sayhello,what
print,'HELLO ',what
end
; main.pro
pro main
sayhello,'WORLD'
end
And the Slurm batch script can be as simple as:
#!/bin/bash
#SBATCH --job-name=sayhello
#SBATCH --output=hello-%j.out
#SBATCH --error=hello-%j.err
idl -e main
Generating plots is not a problem either, as far as you have functions that automate everything for you, so that user interaction is not required (a popular pair is plopen and plclose, developed at Goddard).
Multi-core (MPI) applications
#!/bin/bash
#SBATCH --ntasks=4 # 4 mpi processes
#SBATCH --time=15:00:00
#SBATCH --job-name=YourJob
#SBATCH --output=%j.out
#SBATCH --error=%j.err
module load [...]
srun ./your_mpi_application
Note
To start an MPI application, use srun
rather than mpirun
. Depending on
how the system is configured you might also need to indicate which PMI to use
(i.e. something like srun --mpi=pmi2 ./your_mpi_application
). Details for
each system are given in the corresponding section regarding that machine.
Multi-core (OpenMP) applications
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=15:00:00
#SBATCH --job-name=YourJob
#SBATCH --output=%j.out
#SBATCH --error=%j.err
module load [...]
OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK srun ./your_parallel_application
Note
In some versions of Slurm, the variable OMP_NUM_THREADS
would be defined
according to the --cpus-per-task
option in the batch script and it would be
inherited as well by the srun
command. Starting with version 22.05, srun
does not inherit the value of --cpus-per-task
and (at least in some versions
of Slurm) the variable OMP_NUM_THREADS
is not defined according to the
--cpus-per-task
value. In order to avoid problems, we recommend to run OpenMP
applications as in the example above, explicitly specifying the value of
OMP_NUM_THREADS
and using the srun
command.
Multi-core (MPI+OpenMP) applications
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=2
#SBATCH --time=15:00:00
#SBATCH --job-name=YourJob
#SBATCH --output=%j.out
#SBATCH --error=%j.err
module load [...]
OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK srun ./pi
Note
Read the notes for MPI and OpenMP applications above to understand the options chosen in this example for MPI+OpenMP apps.
Array of single-core jobs
Using Job Arrays
If a script needs to be executed over a series of input data (e.g., files of the form
file0.dat
,file1.dat
,...), a job array can be used (more information about this can be found in the official documentation.In the following example, a total of 10 jobs will be queued, and each one will process a different file.
#!/bin/bash #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 # Each job uses only one core #SBATCH --time=05:00:00 # And lasts for max 5 hours #SBATCH --job-name=YourJob #SBATCH --output=%j_%a.out # <jobid>_<arrayid>.out #SBATCH --error=%j_%a.err #SBATCH --array=0-9 # 10 different jobs inputfile=file${SLURM_ARRAY_TASK_ID}.dat python3 your_script.py $inputfile
The example above works fine if we want to submit a fixed number of jobs (in this case 10), but is not ideal if the number of jobs to submit is not known in advance. In this case we have to find a workaround to modify programatically the description of the
--array
directive. We can do it as follows: imagine we have a filestart.txt
with a number of star IDs that we need to process, one per line. The we can have a submit file namedstars_array.batch
like the following:#!/bin/bash #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --time=02:00:00 #SBATCH --job-name=stars #SBATCH --output=star_logs/%j_%a.out #SBATCH --error=star_logs/%j_%a.err #SBATCH --array=0-98%10 #SBATCH -D . module load python export OMP_NUM_THREADS=1 readarray -t stars < stars.txt python RV_measurement.py ${stars[$SLURM_ARRAY_TASK_ID]}
By using the command
readarray
we can run theRV_measurment.py
script for each line in thestars.txt
file (passing it as an argument), but the description of the--array
directive would be wrong, as that is a fixed number. To avoid this, we can modify this directive depending on the number of lines in thestars.txt
using a script named, for example,stars.submit.sh
as follows:#!/usr/bin/bash end=`awk 'END { print NR - 1 }' stars.txt` sbatch --array=0-$end%20 stars_array.batch
With this, if we run
./stars.submit.sh
, the script will calculate the number of lines in thestars.txt
file and submit a job using thestars_array.batch
submission script, but modifying the--array
directive to create the correct number of jobs.Tip
Take into account that you can submit a large number of jobs, but they may not fit in the machine all at once! If you require large job arrays, consider looking into HTCondor.
Using GNU Parallel
GNU parallel is a shell tool for executing jobs in parallel using one or more computers. It is useful on its own, but we can also use it together with Slurm to submit a large array of jobs. With it, we can submit a single job where we allocate a given number of nodes, and gnuparallel can take care of submitting jobs to this allocation as slots become available. A useful reference cheat-sheet can be found here, while a more detailed tutorial here.
As an example, we could have the following Slurm batch script, where we are asking 64 CPUs in one single job. Then we instruct GNU Parallel to run 64 task in parallel (
-j64
, to fill the whole allocation with jobs), running in the machines listed in file$temp_file
(obtained by running the commandscontrol show hostnames
, to translate from Slurm to GNU Parallel syntax), and running the script$script
, passing as first argument the current directory ($PWD
) and as second argument the sequence from 1-100 (so in total there will be 100 jobs, where the second argument will be different for each, taken from the list 1-100). [The option--tag
tells GNU Parallel to prefix each line of output with the arguments of the job, so as to easily identify where the output comes from].#!/bin/bash ########################################################## #SBATCH -J sequential_PA #SBATCH -n 64 #SBATCH -t 00:04:00 #SBATCH -o sequential_PA-%j.out #SBATCH -e sequential_PA-%j.err #SBATCH -D . ########################################################## module purge module load gnu gnuparallel # write list of allocated nodes into a temporary file temp_file=$(mktemp -q) scontrol show hostnames > $temp_file script=$PWD/sleep.sh parallel -j64 --tag --slf $temp_file $script ::: $PWD ::: `seq 1 100`
The
sleep.sh
script could just be just as the toy example shown below, to illustrate how the current directory is passed to the first argument ($1
), and a number from the sequence 1-100 to the second one.#!/bin/bash echo "EXECUTING TASK ID: $2. About to sleep $2" sleep $2 cat <<EOF > $1/output/outtest$2.out `uname -a` current path: "$PWD" EOF
Jobs with dependencies
If we need to run a number of jobs that depend on each other, we use the
--dependency
option of the sbatch
command. With it, we can specify different
types of dependencies. For example: start a job a number of minutes after
another job has started; start a job after another job has failed execution,
etc. (see the details in the
official Slurm documentation.
While setting job dependencies manually is straightforward, we sometimes need to programatically submit a number of jobs with dependencies. Direct support for these workflows is not provided by Slurm, but we can easily create some scripts to help us in these situations. Below we provide examples for pipelines and DAGs.
Pipelines
The following example shows how to submit 20 jobs (
nruns=20
), where each job will start only if the previous job has finished successfully (dependency=afterok:$id
). Thejob_submit.sh
script is a regular Slurm batch script, and in it we can specify which task to perform by using the variable$nt
(which we pass to the script via--export=nt="$nrun"
).#!/bin/bash nruns=20 id=$(sbatch --parsable --export=nt='1' job_submit.sh) echo "Submitted job $id" for nrun in `seq 2 $nruns`; do id=$(sbatch --parsable --dependency=afterok:$id --export=nt="$nrun" job_submit.sh); echo "Submitted job $id" done
DAG
More complicated pipelines, where a job depends on a number of previous jobs, can be modelled with Directed Acyclic Graphs (DAG). This can also be done with a bash script, using the
--dependency
option of thesbatch
command. In the following example, we create a simple DAG where jobs A and B can execute in parallel, job C can only start after A and B have finished successfully, and jobs D and E can also execute in parallel, but only after a successful execution of job C. As you can see below, each of the individual jobs A-E can have its own Slurm batch script, so each job could be any type of Slurm job (sequential, job array, parallel, etc.).#!/bin/bash jAid=$(sbatch jarrayA.sh | sed 's/Submitted batch job //') jBid=$(sbatch jarrayB.sh | sed 's/Submitted batch job //') jCid=$(sbatch --dependency=afterok:$jAid:$jBid jarrayC.sh | sed 's/Submitted batch job //') jDid=$(sbatch --dependency=afterok:$jCid jarrayD.sh | sed 's/Submitted batch job //') jEid=$(sbatch --dependency=afterok:$jCid jarrayE.sh | sed 's/Submitted batch job //')
Tip
Remember that you can check the status of your job with squeue
, and
follow the output (if any), with tail -f <outputfile>
.
Cancelling jobs
After getting the id of your job either from sbatch
or squeue
, run
scancel <jobid>
.
Advanced Topics
CPU Management
Slurm can provide a very fine control of where each process of your job should run, but in their own words, "the interactions between different CPU management options are complex and often difficult to predict" (https://slurm.schedmd.com/cpu_management.html).
In this section we just provide some background information and some of the Slurm commands and options which we believe are the most useful given the IAC resources and users. Many other options are available, and this might get confusing pretty quickly, but hopefully will give you an idea of the fine-grained control that you can have if your application needs it. Those seeking futher details can follow the Slurm documentation sections regarding CPU Management, Multi-core/multi-thread support, sbatch command, srun command, FAQs, etc.
First of all, we need to understand that, for Slurm, a CPU is the smallest
processing unit of a node, and this can be either a "core" or a "hardware
thread" (for those systems that have them enabled). If you are unsure whether HT
(hardware threads) are enabled in a system, you can issue the following sinfo
command, which will return the number of available "sockets", "cores" and
"threads". If, as in the example below, the number of threads is not one, then
HT are enabled.
sinfo -O SocketCoreThread
S:C:T
2:22:2
In a system like the one in the example above it can make a big difference for parallel applications whether the processes are all running in the same socket; whether hardware threads are used; which processes are running in each socket, etc. (this is out of the scope of these notes, but do get in touch with us if you want further information).
Assuming that you know which CPUs you would like your job to use and how to distribute the tasks amongst those CPUs, the following notes will help you understand how to instruct Slurm to do so. The following applies to the IAC "burros". Other systems might be configured differently and thus the Slurm behaviour might not be the always the same.
CPU allocation
CPU allocation refers to which CPUs Slurm is going to allocate for your job
(either via a batch script or an interactive session). In the IAC "burros", you
just need to understand the options -n
, --ntasks-per-socket
and --ntasks-per-core
.
-n
is the total number of tasks you will run for this job. This equates almost always to the number of processes you want to run. If your application is multi-threaded,-n
will be the total number of threads to use (this can be done in other ways, but we believe this is the simplest to understand).--ntasks-per-socket
and--ntasks-per-core
(only useful in systems with HT enabled) help to specify how to distribute the requested number of tasks amongst sockets and cores.
Options --ntasks-per-socket
and --ntasks-per-core
are optional, to be used
when you need to give Slurm more information about how you want to allocate the
requested CPUs. In many cases, just using the option -n
will be sufficient,
but then the allocation will follow the Slurm defaults and its configuration at
the IAC.
Some examples will help to better understand the options above. Below we use the
command sinter
available in the IAC "burros", but the same options could be used
in a batch script. Remember that these commands are used only to specify
which CPUs will be allocated. Which tasks run in which CPUs will be
specified later with the srun
command. For all Slurm jobs you can see exactly
which CPUs have been allocated to your job with the following command, and
looking at the field CPU_IDs
:
scontrol show job -d <SLURM_JOBID>
a) If we want to run a MPI application with 8 processes, all in the same socket in a system with no HT, we would request those CPUs with:
sinter -n 8 --ntasks-per-socket 8
b) As above, but allocating 4 CPUs in each socket. (This same command could be used for example for a MPI+OpenMP application where we want to run two MPI tasks in each socket, each task using two OpenMP threads).
sinter -n 8 --ntasks-per-socket 4
c) As above, but in a system with HT enabled, requesting to use two threads per core (i.e. using hardware threads).
sinter -n 8 --ntasks-per-socket 4 --ntasks-per-core 2
If instead of the above command you run
sinter -n 6 --ntasks-per-socket 3 --ntasks-per-core 2
, the allocation given by Slurm might surprise you, but we won't go into the details here. Please get in touch with us if interested.d) As above, but requesting to use only one thread per core (i.e. not using hardware threads). In essence we are asking Slurm to allocate eight full cores, but "discarding" one of the hardware threads. Thus, Slurm will bill you for 16 CPUs.
sinter -n 8 --ntasks-per-socket 4 --ntasks-per-core 1
Note
While you might think that this is wasteful and that the above option c) should be better, this is likely not the case for example for MPI cpu-bound applications: hardware threads are not "real" processing units, so for cpu-bound applications, using them will give you basically the same computational power as not using them. But at the same time using hardware threads forces the operating system to perform extra work switching context between more threads, and the memory requested will be much larger than when hardware threads are not used, which will likely cause more memory cache misses and worse performance.
CPU distribution and binding
CPU binding refers to where each process/thread of your job will run. In the
IAC "burros", you just need to understand the options -m
, --cpu-bind
and
--cpus-per-task
.
-m
tells Slurm how to distribute the tasks amongst different nodes, sockets and cores. In the IAC "burros", only the last two are of interest since Slurm operates only with an individual node.--cpu-bind
is used to specify whether a process/thread is "bound" to a particular set of CPUs.--cpus-per-task
is used for multi-threaded applications, in order to specify how many threads each process will spawn.
As for the "CPU allocations" section, some examples will help to clarify the
options above. If you want to see exactly where each process/thread is going to
run, you can add the option --cpu-bind=verbose
to the srun
commands below
(just add verbose
to the other --cpu-bind
options if already using
it). Check the page https://slurm.schedmd.com/srun.html or get in touch with us to understand how to interpret the "mask"s provided.
Imagine that you run the example b) above and you get an allocation of eight CPUs in two different sockets. If you want to run a MPI application with eight tasks, you might be interested in distributing the tasks to sockets in blocks, or with a cyclic distribution.
If using "block distribution", one socket will run tasks [0,1,2,3] and the other socket tasks [4,5,6,7]. The
srun
command to use in this case would look like:srun -n 8 -m *:block <application>
If using "cyclic distribution", one socket will run tasks [0,2,4,6] and the other socket tasks [1,3,5,7]. The
srun
command to use in this case would look like:srun -n 8 -m *:cyclic <application>
Following with example b) above, imagine your application is hybrid MPI+OpenMP and you want to run two MPI tasks in each socket, each task using two OpenMP threads. The option
--cpus-per-task
is very useful in this situation, because we can use the following command:srun -n 4 -m *:cyclic --cpus-per-task=2 <application>
This way, one socket will run MPI tasks [0,2] and the other socket MPI tasks [1,3], while the corresponding OpenMP threads for each task will run in the same socket.
In the example above, we might want to force that Slurm imposes each thread to run always in the same hardware thread, or perhaps we want to allow them to run in any hardware thread as long as it remains within the same core, or within the same socket, or even without any restrictions within the node. This is possible with the
--cpu-bind
option.Luckily, the default binding option is "autobind", which will be the right option in most cases. As per the Slurm documentation, "if the job step allocation includes an allocation with a number of sockets, cores, or threads equal to the number of tasks times cpus-per-task, then the tasks will by default be bound to the appropriate resources (auto binding)." Thus, in the example above, since we are requesting to use 8 CPUs with
srun
and the allocation has 8 hardware threads, Slurm will bind to "threads", meaning that each thread will not be moved to run on a different hardware thread by the operating system.One common scenario for a pure MPI applications in a system with HT enabled will be to make sure that each MPI task uses a full core, as explained in the example d) above. In this case, if we are not worried about how each individual task is distributed (by default it will block distribution to sockets), auto-binding will do the right thing, and we would simply need to issue a
srun
command as follows:srun -n 8 <application>
Since the allocation was created with the option
--ntasks-per-core 1
, each task will be scheduled to run in a different core, and thanks to "autobind" (we have an allocation of 8 cores and we are running 8 tasks) each task will be bound to a core (so it will be able to use either of the hardware threads, but always within the same core).