Slurm

Note

This documentation is a work in progress. Any comment or suggestion is welcome in sinfin@iac.es.

To stay informed about updates to "Slurm", tips, etc., or to ask any questions regarding its use, please use the following IAC-Zulip channels: #computing/burros (if you are using Slurm in the "Burros") or #computing/hpc (if you are using Slurm in LaPalma or in TeideHPC).

At the IAC we use the Slurm workload manager (widely used in many research institutes and supercomputing centres) in the "burros", LaPalma and TeideHPC.

Here, we present a general guide on how to use Slurm at IAC machines and some examples of basic usage. For more detailed information, please refer to the official documentation.

Note

Some of the systems may have different configurations or extra commands. Those will be specified in the corresponding section regarding that machine.

Checking the queue

In order to check the list of jobs currently in the queue, you use the command:

squeue

It will print the list of jobs executing or waiting, who is running them and how many resources are using.

Tip

Use squeue --me to only see the jobs that you have submitted.

Note

In LaPalma, the squeue command only shows your jobs (i.e. the same as squeue --me in other machines). To see the list of all jobs in the queue you can use the command squeue-all. See the LaPalma section for details.

With squeue you can get an idea of the usage, but sometimes it can have too much information. If you simply want to check how many CPUs are currently available in a machine:

$ sinfo -o %C
CPUS(A/I/O/T)
136/56/0/192

In this case, 136 cores are Allocated, 56 Idle from a Total of 192.

Batch jobs

To run scripts in Slurm you must provide a "batch script", which is submitted using sbatch.

The script starts with information about the allocation and then is a simple bash script that can perform any action you like (compile, run, analyse data,...). To submit a job defined in a file, e.g., slurm.job, run

$ sbatch slurm.job
Submitted batch job 245418  #<--- this is the <jobid>

When a job is submitted, it enters the queue. It will then be executed when there are enough computational resources available and there is no other job with higher priority waiting for them. After if finishes or the allocated time ends, the resources will be freed and other jobs in the queue may use them.

To increase the possibility of your job being executed, and to improve the use of the machine, please set a sensible runtime (--time=hh:mm:ss). If you know that your application takes a few hours, do not ask to reserve the system for a whole day!

GPUs

There are burros that have GPUs installed. To use them within Slurm, you must request them explicitly using the --gres=gpu:1 option, where "1" states the number of GPUs requested.

Tip

Please take a look to the default configuration for burros with GPUs, which is described here.

Sample batch scripts

Note

The scripts below try to be as generic as possible, but remember that you should adapt them for the system you are using. For example, IDL is not installed in LaPalma, and some software modules might have different names in LaPalma and in the "Burros", etc.

Attention

Remember to load all the needed modules in you batch scripts!

Python examples

  • Single-core Python script.

    The stardard error and output will be written to the files <jobid>.err and <jobid>.out, respectively. The <jobid> will be shown when you submit your job or when doing squeue.

    #!/bin/bash
    #SBATCH --ntasks=1
    #SBATCH --cpus-per-task=1
    #SBATCH --time=05:00:00
    #SBATCH --job-name=YourJob
    #SBATCH --output=%j.out
    #SBATCH --error=%j.err
    module load python
    
    python3 your_script.py [args...]
    
  • Python multiprocessing script.

    Note

    In the batch script below we set the variable OMP_NUM_THREADS to 1. While in some corner cases this might not be what you need, in general you should include it when submitting multiprocessing Python jobs with Slurm. The reason for this is the following: Numpy (and also other libraries) is generally configured to be multithreaded, so that a program using it will, by default, execute the Numpy routines in parallel using as many threads as cores available in the machine. Thus, in the example below, without setting OMP_NUM_THREADS, the multiprocessing library would generate 20 processes, and if using Numpy then each process would generate 20 threads when executing Numpy routines. But this would most likely harm the overall performance, since now there would be 400 active threads but Slurm is restricting the job to use only 20 cores. If setting OMP_NUM_THREADS to 1, each of the 20 processes generated by the multiprocessing library would have a full core to itself, generating less load in the machine and improving the overall performance of the job.

    In the case the code uses Python multiprocessing module for paralellization, you need to provide the size of the allocation to the script. We provide a simple example:

    #!/bin/bash
    #SBATCH --ntasks=1
    #SBATCH --cpus-per-task=20
    #SBATCH --time=05:00:00
    #SBATCH --job-name=YourJob
    #SBATCH --output=%j.out
    #SBATCH --error=%j.err
    module load python
    
    export OMP_NUM_THREADS=1
    python3 my_parallel_app.py $SLURM_CPUS_ON_NODE [other args...]
    

    where my_parallel_app.py would be similar to:

    import sys
    from multiprocessing import Pool
    import numpy as np
    def function(it):
        import time
        print('Process = ',it, flush=True)
        # ====> Here you implement your iterations or anything you like <=====
        time.sleep(3)
        return it
        # =========
    
    if __name__=='__main__':
        ncpus = int(sys.argv[1]) # available cpus for the job
        niter = 1000             # total number of iterations
        print("Number of CPU cores:", ncpus)
        p = Pool(ncpus)
        out = p.map(function, range(0,niter,1))
        p.close()
        p.join()
        print(out)
    

    In the case that you do not care about the order of the results (e.g., analysis of independent images), you can use this example:

    import sys
    from multiprocessing import Pool
    def callback(result):
        print(f"The file {result} has been saved!", flush=True)
    
    def function(it):
        import time
        #print('Process = ',it)
        # NOTE: the files can not be shared among iterations unless you have
        #       extreme care
        file_in = f"my_input_file_{it}"
        file_out = f"my_output_file_{it}"
        # ====> Here you implement your iterations or anything you like <=====
        # np.load(file_in)
        # ....
        # np.save(file_out)
        time.sleep(3)
        return file_out
        # =========
    
    if __name__=='__main__':
        ncpus = int(sys.argv[1]) # available cpus for the job
        niter = 100              # total number of iterations
        print("Number of CPU cores:", ncpus)
        p = Pool(ncpus)
        for i in range(niter):
            p.apply_async(function, args=(i,), callback=callback)
        p.close()
        p.join()
    

IDL jobs

Running IDL jobs that don't require user interaction is very simple, as shown with the following "Hello World" example. For this example, we have the sayhello.pro and main.pro source files:

; sayhello.pro
pro sayhello,what
  print,'HELLO ',what
end
; main.pro
pro main
  sayhello,'WORLD'
end

And the Slurm batch script can be as simple as:

#!/bin/bash
#SBATCH --job-name=sayhello
#SBATCH --output=hello-%j.out
#SBATCH --error=hello-%j.err

idl -e main

Generating plots is not a problem either, as far as you have functions that automate everything for you, so that user interaction is not required (a popular pair is plopen and plclose, developed at Goddard).

Multi-core (MPI) applications

#!/bin/bash
#SBATCH --ntasks=4                # 4 mpi processes
#SBATCH --time=15:00:00
#SBATCH --job-name=YourJob
#SBATCH --output=%j.out
#SBATCH --error=%j.err

module load [...]

srun ./your_mpi_application

Note

To start an MPI application, use srun rather than mpirun. Depending on how the system is configured you might also need to indicate which PMI to use (i.e. something like srun --mpi=pmi2 ./your_mpi_application). Details for each system are given in the corresponding section regarding that machine.

Multi-core (OpenMP) applications

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=15:00:00
#SBATCH --job-name=YourJob
#SBATCH --output=%j.out
#SBATCH --error=%j.err

module load [...]

OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK srun ./your_parallel_application

Note

In some versions of Slurm, the variable OMP_NUM_THREADS would be defined according to the --cpus-per-task option in the batch script and it would be inherited as well by the srun command. Starting with version 22.05, srun does not inherit the value of --cpus-per-task and (at least in some versions of Slurm) the variable OMP_NUM_THREADS is not defined according to the --cpus-per-task value. In order to avoid problems, we recommend to run OpenMP applications as in the example above, explicitly specifying the value of OMP_NUM_THREADS and using the srun command.

Multi-core (MPI+OpenMP) applications

#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=2
#SBATCH --time=15:00:00
#SBATCH --job-name=YourJob
#SBATCH --output=%j.out
#SBATCH --error=%j.err

module load [...]

OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK srun ./pi

Note

Read the notes for MPI and OpenMP applications above to understand the options chosen in this example for MPI+OpenMP apps.

Array of single-core jobs

  • Using Job Arrays

    If a script needs to be executed over a series of input data (e.g., files of the form file0.dat, file1.dat,...), a job array can be used (more information about this can be found in the official documentation.

    In the following example, a total of 10 jobs will be queued, and each one will process a different file.

    #!/bin/bash
    #SBATCH --ntasks=1
    #SBATCH --cpus-per-task=1   # Each job uses only one core
    #SBATCH --time=05:00:00     # And lasts for max 5 hours
    #SBATCH --job-name=YourJob
    #SBATCH --output=%j_%a.out  # <jobid>_<arrayid>.out
    #SBATCH --error=%j_%a.err
    #SBATCH --array=0-9         # 10 different jobs
    
    inputfile=file${SLURM_ARRAY_TASK_ID}.dat
    python3 your_script.py $inputfile
    

    The example above works fine if we want to submit a fixed number of jobs (in this case 10), but is not ideal if the number of jobs to submit is not known in advance. In this case we have to find a workaround to modify programatically the description of the --array directive. We can do it as follows: imagine we have a file start.txt with a number of star IDs that we need to process, one per line. The we can have a submit file named stars_array.batchlike the following:

    #!/bin/bash
    #SBATCH --ntasks=1
    #SBATCH --cpus-per-task=1
    #SBATCH --time=02:00:00
    #SBATCH --job-name=stars
    #SBATCH --output=star_logs/%j_%a.out
    #SBATCH --error=star_logs/%j_%a.err
    #SBATCH --array=0-98%10
    #SBATCH -D .
    
    module load python
    export OMP_NUM_THREADS=1
    
    readarray -t stars < stars.txt
    
    python RV_measurement.py ${stars[$SLURM_ARRAY_TASK_ID]}
    

    By using the command readarray we can run the RV_measurment.py script for each line in the stars.txt file (passing it as an argument), but the description of the --array directive would be wrong, as that is a fixed number. To avoid this, we can modify this directive depending on the number of lines in the stars.txt using a script named, for example, stars.submit.sh as follows:

    #!/usr/bin/bash
    
    end=`awk 'END { print NR - 1 }' stars.txt`
    sbatch  --array=0-$end%20 stars_array.batch
    

    With this, if we run ./stars.submit.sh, the script will calculate the number of lines in the stars.txt file and submit a job using the stars_array.batch submission script, but modifying the --array directive to create the correct number of jobs.

    Tip

    Take into account that you can submit a large number of jobs, but they may not fit in the machine all at once! If you require large job arrays, consider looking into HTCondor.

  • Using GNU Parallel

    GNU parallel is a shell tool for executing jobs in parallel using one or more computers. It is useful on its own, but we can also use it together with Slurm to submit a large array of jobs. With it, we can submit a single job where we allocate a given number of nodes, and gnuparallel can take care of submitting jobs to this allocation as slots become available. A useful reference cheat-sheet can be found here, while a more detailed tutorial here.

    As an example, we could have the following Slurm batch script, where we are asking 64 CPUs in one single job. Then we instruct GNU Parallel to run 64 task in parallel (-j64, to fill the whole allocation with jobs), running in the machines listed in file $temp_file (obtained by running the command scontrol show hostnames, to translate from Slurm to GNU Parallel syntax), and running the script $script, passing as first argument the current directory ($PWD) and as second argument the sequence from 1-100 (so in total there will be 100 jobs, where the second argument will be different for each, taken from the list 1-100). [The option --tag tells GNU Parallel to prefix each line of output with the arguments of the job, so as to easily identify where the output comes from].

    #!/bin/bash
    ##########################################################
    #SBATCH -J sequential_PA
    #SBATCH -n 64
    #SBATCH -t 00:04:00
    #SBATCH -o sequential_PA-%j.out
    #SBATCH -e sequential_PA-%j.err
    #SBATCH -D .
    ##########################################################
    
    module purge
    module load gnu gnuparallel
    
    # write list of allocated nodes into a temporary file
    temp_file=$(mktemp -q)
    scontrol show hostnames > $temp_file
    
    script=$PWD/sleep.sh
    parallel -j64 --tag --slf $temp_file $script ::: $PWD ::: `seq 1 100`
    

    The sleep.sh script could just be just as the toy example shown below, to illustrate how the current directory is passed to the first argument ($1), and a number from the sequence 1-100 to the second one.

    #!/bin/bash
    
    echo "EXECUTING TASK ID: $2. About to sleep $2"
    sleep $2
    
    cat <<EOF > $1/output/outtest$2.out
    `uname -a`
    current path: "$PWD"
    EOF 
    

Jobs with dependencies

If we need to run a number of jobs that depend on each other, we use the --dependency option of the sbatch command. With it, we can specify different types of dependencies. For example: start a job a number of minutes after another job has started; start a job after another job has failed execution, etc. (see the details in the official Slurm documentation.

While setting job dependencies manually is straightforward, we sometimes need to programatically submit a number of jobs with dependencies. Direct support for these workflows is not provided by Slurm, but we can easily create some scripts to help us in these situations. Below we provide examples for pipelines and DAGs.

  • Pipelines

    The following example shows how to submit 20 jobs (nruns=20), where each job will start only if the previous job has finished successfully (dependency=afterok:$id). The job_submit.sh script is a regular Slurm batch script, and in it we can specify which task to perform by using the variable $nt (which we pass to the script via --export=nt="$nrun").

    #!/bin/bash
    
    nruns=20
    
    id=$(sbatch --parsable --export=nt='1' job_submit.sh)
    echo "Submitted job $id"
    for nrun in `seq 2 $nruns`; do
       id=$(sbatch --parsable --dependency=afterok:$id --export=nt="$nrun" job_submit.sh);
       echo "Submitted job $id"
    done
    
  • DAG

    More complicated pipelines, where a job depends on a number of previous jobs, can be modelled with Directed Acyclic Graphs (DAG). This can also be done with a bash script, using the --dependency option of the sbatch command. In the following example, we create a simple DAG where jobs A and B can execute in parallel, job C can only start after A and B have finished successfully, and jobs D and E can also execute in parallel, but only after a successful execution of job C. As you can see below, each of the individual jobs A-E can have its own Slurm batch script, so each job could be any type of Slurm job (sequential, job array, parallel, etc.).

    #!/bin/bash
    
    jAid=$(sbatch jarrayA.sh | sed 's/Submitted batch job //')
    jBid=$(sbatch jarrayB.sh | sed 's/Submitted batch job //')
    
    jCid=$(sbatch --dependency=afterok:$jAid:$jBid jarrayC.sh | sed 's/Submitted batch job //')
    
    jDid=$(sbatch --dependency=afterok:$jCid jarrayD.sh | sed 's/Submitted batch job //')
    jEid=$(sbatch --dependency=afterok:$jCid jarrayE.sh | sed 's/Submitted batch job //')
    

Tip

Remember that you can check the status of your job with squeue, and follow the output (if any), with tail -f <outputfile>.

Cancelling jobs

After getting the id of your job either from sbatch or squeue, run scancel <jobid>.

Advanced Topics

CPU Management

Slurm can provide a very fine control of where each process of your job should run, but in their own words, "the interactions between different CPU management options are complex and often difficult to predict" (https://slurm.schedmd.com/cpu_management.html).

In this section we just provide some background information and some of the Slurm commands and options which we believe are the most useful given the IAC resources and users. Many other options are available, and this might get confusing pretty quickly, but hopefully will give you an idea of the fine-grained control that you can have if your application needs it. Those seeking futher details can follow the Slurm documentation sections regarding CPU Management, Multi-core/multi-thread support, sbatch command, srun command, FAQs, etc.

First of all, we need to understand that, for Slurm, a CPU is the smallest processing unit of a node, and this can be either a "core" or a "hardware thread" (for those systems that have them enabled). If you are unsure whether HT (hardware threads) are enabled in a system, you can issue the following sinfo command, which will return the number of available "sockets", "cores" and "threads". If, as in the example below, the number of threads is not one, then HT are enabled.

sinfo -O SocketCoreThread
S:C:T
2:22:2

In a system like the one in the example above it can make a big difference for parallel applications whether the processes are all running in the same socket; whether hardware threads are used; which processes are running in each socket, etc. (this is out of the scope of these notes, but do get in touch with us if you want further information).

Assuming that you know which CPUs you would like your job to use and how to distribute the tasks amongst those CPUs, the following notes will help you understand how to instruct Slurm to do so. The following applies to the IAC "burros". Other systems might be configured differently and thus the Slurm behaviour might not be the always the same.

CPU allocation

CPU allocation refers to which CPUs Slurm is going to allocate for your job (either via a batch script or an interactive session). In the IAC "burros", you just need to understand the options -n, --ntasks-per-socket and --ntasks-per-core.

  • -n is the total number of tasks you will run for this job. This equates almost always to the number of processes you want to run. If your application is multi-threaded, -n will be the total number of threads to use (this can be done in other ways, but we believe this is the simplest to understand).

  • --ntasks-per-socket and --ntasks-per-core (only useful in systems with HT enabled) help to specify how to distribute the requested number of tasks amongst sockets and cores.

Options --ntasks-per-socket and --ntasks-per-core are optional, to be used when you need to give Slurm more information about how you want to allocate the requested CPUs. In many cases, just using the option -n will be sufficient, but then the allocation will follow the Slurm defaults and its configuration at the IAC.

Some examples will help to better understand the options above. Below we use the command sinter available in the IAC "burros", but the same options could be used in a batch script. Remember that these commands are used only to specify which CPUs will be allocated. Which tasks run in which CPUs will be specified later with the srun command. For all Slurm jobs you can see exactly which CPUs have been allocated to your job with the following command, and looking at the field CPU_IDs:

scontrol show job -d <SLURM_JOBID>
  • a) If we want to run a MPI application with 8 processes, all in the same socket in a system with no HT, we would request those CPUs with:

    sinter -n 8 --ntasks-per-socket 8
    
  • b) As above, but allocating 4 CPUs in each socket. (This same command could be used for example for a MPI+OpenMP application where we want to run two MPI tasks in each socket, each task using two OpenMP threads).

    sinter -n 8 --ntasks-per-socket 4
    
  • c) As above, but in a system with HT enabled, requesting to use two threads per core (i.e. using hardware threads).

    sinter -n 8 --ntasks-per-socket 4 --ntasks-per-core 2
    

    If instead of the above command you run sinter -n 6 --ntasks-per-socket 3 --ntasks-per-core 2, the allocation given by Slurm might surprise you, but we won't go into the details here. Please get in touch with us if interested.

  • d) As above, but requesting to use only one thread per core (i.e. not using hardware threads). In essence we are asking Slurm to allocate eight full cores, but "discarding" one of the hardware threads. Thus, Slurm will bill you for 16 CPUs.

    sinter -n 8 --ntasks-per-socket 4 --ntasks-per-core 1
    

    Note

    While you might think that this is wasteful and that the above option c) should be better, this is likely not the case for example for MPI cpu-bound applications: hardware threads are not "real" processing units, so for cpu-bound applications, using them will give you basically the same computational power as not using them. But at the same time using hardware threads forces the operating system to perform extra work switching context between more threads, and the memory requested will be much larger than when hardware threads are not used, which will likely cause more memory cache misses and worse performance.

CPU distribution and binding

CPU binding refers to where each process/thread of your job will run. In the IAC "burros", you just need to understand the options -m, --cpu-bind and --cpus-per-task.

  • -m tells Slurm how to distribute the tasks amongst different nodes, sockets and cores. In the IAC "burros", only the last two are of interest since Slurm operates only with an individual node.

  • --cpu-bind is used to specify whether a process/thread is "bound" to a particular set of CPUs.

  • --cpus-per-task is used for multi-threaded applications, in order to specify how many threads each process will spawn.

As for the "CPU allocations" section, some examples will help to clarify the options above. If you want to see exactly where each process/thread is going to run, you can add the option --cpu-bind=verbose to the srun commands below (just add verbose to the other --cpu-bind options if already using it). Check the page https://slurm.schedmd.com/srun.html or get in touch with us to understand how to interpret the "mask"s provided.

  • Imagine that you run the example b) above and you get an allocation of eight CPUs in two different sockets. If you want to run a MPI application with eight tasks, you might be interested in distributing the tasks to sockets in blocks, or with a cyclic distribution.

    If using "block distribution", one socket will run tasks [0,1,2,3] and the other socket tasks [4,5,6,7]. The srun command to use in this case would look like:

    srun -n 8 -m *:block <application>
    

    If using "cyclic distribution", one socket will run tasks [0,2,4,6] and the other socket tasks [1,3,5,7]. The srun command to use in this case would look like:

    srun -n 8 -m *:cyclic <application>
    
  • Following with example b) above, imagine your application is hybrid MPI+OpenMP and you want to run two MPI tasks in each socket, each task using two OpenMP threads. The option --cpus-per-task is very useful in this situation, because we can use the following command:

    srun -n 4 -m *:cyclic --cpus-per-task=2 <application>
    

    This way, one socket will run MPI tasks [0,2] and the other socket MPI tasks [1,3], while the corresponding OpenMP threads for each task will run in the same socket.

  • In the example above, we might want to force that Slurm imposes each thread to run always in the same hardware thread, or perhaps we want to allow them to run in any hardware thread as long as it remains within the same core, or within the same socket, or even without any restrictions within the node. This is possible with the --cpu-bind option.

    Luckily, the default binding option is "autobind", which will be the right option in most cases. As per the Slurm documentation, "if the job step allocation includes an allocation with a number of sockets, cores, or threads equal to the number of tasks times cpus-per-task, then the tasks will by default be bound to the appropriate resources (auto binding)." Thus, in the example above, since we are requesting to use 8 CPUs with srun and the allocation has 8 hardware threads, Slurm will bind to "threads", meaning that each thread will not be moved to run on a different hardware thread by the operating system.

  • One common scenario for a pure MPI applications in a system with HT enabled will be to make sure that each MPI task uses a full core, as explained in the example d) above. In this case, if we are not worried about how each individual task is distributed (by default it will block distribution to sockets), auto-binding will do the right thing, and we would simply need to issue a srun command as follows:

    srun -n 8 <application>
    

    Since the allocation was created with the option --ntasks-per-core 1, each task will be scheduled to run in a different core, and thanks to "autobind" (we have an allocation of 8 cores and we are running 8 tasks) each task will be bound to a core (so it will be able to use either of the hardware threads, but always within the same core).