Submit script files

Introduction

SLURM is the utility used at LaPalma for batch processing support, so all jobs must be run through it. This document provides information for getting started with job execution at LaPalma. Here we will describe the most important options and some examples of submission script files are available, but we recommend you check the SLURM Quick Start User Guide.

In order to keep the login nodes in a proper load, a 10 minutes limitation in the cpu time is set for processes running interactively in these nodes. Any execution taking more than this limit should be carried out through the queue system (see this FAQ).

Queues (QOS)

The user's limits are assigned automatically to each particular user (depending on the resources granted by the Access Committee). Anyway you are allowed to use the special queue debug in order to perform some fast short tests.

Queues

Max CPUs

Wall time limit

class_a

2400

72 hours

class_b

1200

48 hours

class_c

1200

24 hours

debug

64

30 min

interactive

1

1 hour

The specific limits assigned to each user depends on the priority granted by the access committee. Users granted with high priority hours will have access to a maximum of 2400 CPUs and a maximum wall clock limit of 72 hours. For users with low priority hours the limits are 1200 CPUs and 24 hours. If you need to increase these limits please contact the support group.

  • class_a, class_b and class_c: Queues assigned by the access committee and where normal jobs will be executed, no special directive is needed to use these queues, they will be automatically assigned.

  • debug: This queue is reserved for testing the applications before submitting them to the production queues. Only one job per user is allowed to run simultaneously in this queue, and the execution time will be limited to 30 minutes. The maximum number of cpus per application is 64. Only a limited number of jobs may be running at the same time in this queue. To use this queue add a directive in your script file, or also specify the queue when submitting without changing the script:

    #SBATCH --qos=debug
     - or -
    [lapalma1]$ sbatch --qos=debug script.sub
    
  • interactive: Jobs submitted to this queue will run in the interactive (login) node. It is intended to run GUI applications that may exceed the interactive cpu time limit. Note that only sequential jobs are allowed. To use this queue launch the following command from login1 (see this FAQ):

    [lapalma1]$ salloc -p interactive
    

Submission directives

A job must contain a series of directives to inform the batch system about the characteristics of the job. These directives appear as comments in the job script, with the following syntax:

#SBATCH --directive=<value>

Some common directives have a shorter version, you can use both forms:

#SBATCH -d <value>

Additionally, the job script may contain a set of commands to execute. If not, an external script must be provided with the 'executable' directive. Here you may find the most common directives (complete list here):

  • -J ...: Name of the job

  • --qos <queue_name>: The queue where the job is to be submitted. Let this field empty unless you need to use debug queue

  • -t ...: walltime (use format hh:mm:ss or days-hh:mm:ss)

  • -n ...: Number of tasks, this is the normal way to specify how many cores you want to use

  • -o /path/to/file_out: Redirect standard output (stdout) to file_out (use /dev/null to ignore this output)

  • -e /path/to/file_err: Redirect error output (stderr) to file_err (use /dev/null to ignore this output)

  • -D <directory>: Execution will be performed in the specified directory (if it is not set, current directory will be used)

Note

  • Walltime (the limit of wall clock time) must be set using format HH:MM:SS or DD-HH:MM:SS to a value greater than the real execution time for your application, bear in mind that your job will be killed after the period you specified. Shorter limits are likely to reduce the waiting time in the queue. If you do not specify any time limit, the maximum available in your assigned queue will be used.

  • To avoid overwriting the standard and error output files if you submit several jobs, add %j to the filenames in order to automatically include the job Id in it (see examples)

  • You can use the script idlenodes to know the number of idle nodes at that moment, that could be useful when deciding how many nodes you could ask for in order to wait less time in the queue (also you can use idlecores to know the number of idle cores, that is 16 times the number of idle nodes).

  • If you are running hybrid MPI+OpenMP applications, where each process will spawn a number of threads, use --cpus-per-task=<number> to specify the number of cpus allocated for each task (it must be an integer between 1 and 16, since each node has 16 cores), and then accordingly set the number of tasks per node with --ntasks-per-node=<ntasks> (and/or the number of tasks per core with --ntasks-per-core=<ntasks>, if needed). In this case, it could be also useful to specify the number of total nodes with -N, instead of the number of tasks with -n.

  • Each node has 32 GB of memory, so when an application uses more than 1.7 GB of memory per process, it is not possible to have 16 processes in the same node. Then you can combine --ntasks-per-node and --cpus-per-task directives to run less processes per node, so each of them will have more available memory (in this case some cores will stay idle, but they will still count to calculate the total consumed time, so try to minimize the wasted cores).

  • Before submitting large jobs, please, perform some short tests to make sure your program is running fine. When running your jobs, check outputs and logs from time to time, and cancel the job if application fails.

  • There many more options, like specifying dependencies among jobs with -d directive, giving a starting time with --begin, automatically requeue job if it fails with --requeue, etc. (see complete list here and also this FAQ).

How to specify the submission options

You can specify these options in the command line:

[lapalma1]$ sbatch -J <job_name> -t <days-HH:MM:SS> <your_executable>

But we highly recommend you write all the commands in a file (called submission script file) so you can reuse it when needed. That file should have next sections:

  1. Submission file must be a executable script (although no execute permit is needed) beginning with line #!/bin/bash

  2. SLURM options (as many as needed): #SBATCH -directive [<value>]

  3. Modules to be loaded (as many as needed). Your environment variables will be stored when submitting your job and then used when executing the program. This could be a problem if your environment at submission time is not the proper one to execute your programs (for instance, no path to executables or dynamic libraries are set), so we recommend you begin cleaning your environment with module purge and then load only the modules required by your program.

  4. Shell commands needed to run your application.

Once your script file is ready, you only need to use next command to submit it to the queue:

[lapalma1]$ sbatch script_file

If you need more information about how to manage your jobs, check also the Useful Commands (executions) and the FAQs.

Environment variables

Although is not needed in most situations, there are also some SLURM environment variables you can use in your scripts if you need them.

Variable

Meaning

SLURM_JOBID

Specifies the job ID of the executing job

SLURM_NPROCS

Specifies the total number of processes in the job

SLURM_NNODES

Is the actual number of nodes assigned to run your job

SLURM_PROCID

Specifies the MPI rank (or relative process ID) for the current process. The range is from 0-(SLURM_NPROCS-1)

SLURM_NODEID

Specifies relative node ID of the current job. The range is from 0-(SLURM_NNODES-1)

SLURM_LOCALID

Specifies the node-local task ID for the process within a job

SLURM_NODELIST

Specifies the list of nodes on which the job is actually running

SLURM_ARRAY_TASK_ID

Task ID inside the job array

SLURM_ARRAY_JOB_ID

Job ID (it will be the same for all jobs array, the same as the SLURM_JOBID of the first task).

Examples of submission script files

Attention

This section is obsolete, and we are moving the examples to the Slurm Sample batch scripts section.

Here you will find some examples about script files for different situations.

Note

If you copy and paste these examples, be careful because some unwanted spaces may be added at the beginning of each line: make sure that lines that contain parameters begin with #SBATCH and there are no spaces before these symbols.

Basic example (MPI)

You want to run your MPI program called myprogram_mpi using 64 cores (4 nodes) and that should take about 5 hours (add always some extra time because your application will be killed if it overpasses this wall time limit):

#!/bin/bash
#############################
#SBATCH -J test_mpi
#SBATCH -n 64
#SBATCH -t 05:30:00
#SBATCH -o %x-%j.out
#SBATCH -e %x-%j.err
#SBATCH -D .
#############################

module purge
module load gnu openmpi/gnu

mpirun ./myprogram_mpi

# Use these other options if your MPI program does not run properly
# srun  ./myprogram_mpi
# srun --mpi=pmi2 ./myprogram_mpi

Comments:

  • -n 64: this script will run the application myprogram_mpi on 64 cores (4 nodes)

  • -D . The working directory will be the current one (where the submission was performed from)

  • -o and -e: Two output files will be created, one for the standard output (-o, extension .out) and another one for errors (-e, extension .err). Note that we have used %x, so those files will be named using the job name specified with -J (test_mpi). We have also included the parameter %j in the names of those files, so the job ID will be added to them, in order to avoid overwriting the output files if we execute several times this script, since each execution will have a different job ID (this ID is shown when you submit the script using sbatch. You can also get it using squeue once the submission is done and the job has not finished yet). For instance, if your job name was test_mpi and the job ID was 1234, files will be named test_mpi-1234.out and test_mpi-1234.err.

  • Remember that you cannot directly run MPI programs, you need to use srun or mpirun to execute them. If no more arguments are added, the number of slots specified by the SBATCH parameters will be used, but you can also force value using srun -n 20, srun -n $SLURM_NTASKS, etc. If you have problems running MPI programs (they do not initialize, or they are executed sequentially, change the command or options to run your program, you can use one of the next ones: mpirun, srun, srun --mpi=pmi2, etc.

  • Do not forget to load all needed modules. For instance, if you want to execute VASP, you will need to use next commands and later use mpirun to run VASP:

    module purge
    module load intel mkl vasp
    

Basic example (OpenMP)

You want to run your OpenMP program called myprogram_omp (written in C or fortran) using 16 slots (this is the maximum number of available slots with shared memory to run OpenMP). This program should take about 30 minutes (add always some extra time because your application will be killed if it overpasses this wall time limit)

#!/bin/bash
#############################
#SBATCH -J test_omp
#SBATCH -n 16
#SBATCH -t 00:45:00
#SBATCH -o %x-%j.out
#SBATCH -e %x-%j.err
#SBATCH -D .
#############################

module purge
module load gnu

export OMP_NUM_THREADS=16
./myprogram_omp

Comments:

Caution

Be sure you execute your OpenMP programs directly, and you do NOT use mpirun or srun (unless you have a hybrid MPI-OpenMP), since using them several instance of your OpenMP program will be repeated.

  • Using this script our application myprogram_omp will be executed with 16 slots in one node (setting OMP_NUM_THREADS to 16 is not really needed, since by default the number of tasks will be used. If for any reason you want to execute with a different number of threads, then you can use this variable to set it).

  • The working directory will be the current one (where the submission was performed from).

  • -o and -e: Two output files will be created, one for the standard output (-o, extension .out) and another one for errors (-e, extension .err). Note that we have used %x, so those files will be named using the job name specified with -J (test_omp). We have also included the parameter %j in the names of those files, so the job ID will be added to them, in order to avoid overwriting the output files if we execute several times this script, since each execution will have a different job ID (this ID is shown when you submit the script using sbatch. You can also get it using squeue once the submission is done and the job has not finished yet). For instance, if your job name was test_omp and the job ID was 1234, files will be named test_omp-1234.out and test_omp-1234.err.

Jobs array

Jobs array and task generation can be used to run applications over different inputs.

Example of Job Array (parallel programs)

For instance, assume that you have 10 different input files (named input000.dat, input002.dat, input004.dat, ..., input018.dat) and you want to process each file with your MPI parallel program. Each execution will use 32 cores and should not take more than 1 hour to finish (we will add some extra time just to be sure). Then, your script should be similar to the next one:

#!/bin/bash
##########################################################
#SBATCH -J test_MPI_jobsarray
#SBATCH -n 32
#SBATCH -t 0-1:10:00
#SBATCH --array=0-18:2
#SBATCH -o test_jobsarray-%A-%j-%a.out
#SBATCH -e test_jobsarray-%A-%j-%a.err
#SBATCH -D .
##########################################################

module purge
module load gnu openmpi/gnu

echo "#1 EXECUTING TASK ID: $SLURM_ARRAY_TASK_ID"
fmtID=$(printf "%03d" $SLURM_ARRAY_TASK_ID)
srun ./mpi_program -i input$fmtID.dat

Let us explain the parameters that we have used:

  • -n 32 is used to specify that we want that each task will be executed with 32 cores.

  • With --array parameter we specify that we want that our job generates task. The task ID will be generated with format --array=ini-end:step, so using 0-18:2 the IDs will be 0, 2, 4, ..., 18. Some more examples:

    • --array=1,4,5,8,12,65 will produce IDs 1, 4, 5, 8, 12, 65

    • --array=1-10 will produce IDs 1, 2, 3, ..., 10 (step can be omitted when its value is 1)

Important

Note that we are asking for 10 tasks x 32 cores = 320 cores. Only submit large number of tasks when you are totally sure that everything is working fine. If you are still testing, submit only 2 or 3 short tasks to avoid wasting resources. If you need to submit a really large number of tasks, please, consider limiting the maximum number of simultaneously running tasks. That could be done using syntax --array=ini-end%limit or --array=ini-end:step%limit.

  • We have used the environment variable $SLURM_ARRAY_TASK_ID to access the task ID. We have converted the original format of the task ID assigned by SLURM (0, 2, 4, ..., 18) to the required format (000, 002, 004, ..., 018) using bash printf function, storing that value in fmtID (note that the right syntax is fmtID=$(...), with no space before of after the "=" symbol).

  • We have named our files test_arrayjobs-%A-%j-%a:

    • %A: The job array ID: it will have a fixed value, the same that the job submission. You can access to this value in your script using environment variable called $SLURM_ARRAY_JOB_ID.

    • %j: The job ID, the first task will have the value given when submitting the job, and then next tasks will have successive values incremented by one. You can access to this value in your script using environment variable called $SLURM_JOBID.

    • %a: The task ID: it will have the values that you specified using -array parameter. You can access to this value in your script using environment variable called $SLURM_ARRAY_TASK_ID, this is the variable that you will typically use to specify your input files or arguments.

For instance, if you submit the previous example and got the next message:

Submitted batch job 416

Then the generated files will be the next ones:

test_jobsarray-416-416-0.err   #(%A = $SLURM_ARRAY_JOB_ID = 416, %j = $SLURM_JOBID = 416, %a = $SLURM_ARRAY_TASK_ID = 0)
test_jobsarray-416-416-0.out
test_jobsarray-416-417-2.err   #(%A = $SLURM_ARRAY_JOB_ID = 416, %j = $SLURM_JOBID = 417, %a = $SLURM_ARRAY_TASK_ID = 2)
test_jobsarray-416-417-2.out
test_jobsarray-416-418-4.err   #(%A = $SLURM_ARRAY_JOB_ID = 416, %j = $SLURM_JOBID = 418, %a = $SLURM_ARRAY_TASK_ID = 4)
test_jobsarray-416-418-4.out
test_jobsarray-416-419-6.err   #(%A = $SLURM_ARRAY_JOB_ID = 416, %j = $SLURM_JOBID = 419, %a = $SLURM_ARRAY_TASK_ID = 6)
test_jobsarray-416-419-6.out
test_jobsarray-416-420-8.err   #(%A = $SLURM_ARRAY_JOB_ID = 416, %j = $SLURM_JOBID = 420, %a = $SLURM_ARRAY_TASK_ID = 8)
test_arrayjobs-416-420-8.out
test_arrayjobs-416-421-10.err  #(%A = $SLURM_ARRAY_JOB_ID = 416, %j = $SLURM_JOBID = 421, %a = $SLURM_ARRAY_TASK_ID = 10)
test_jobsarray-416-421-10.out
test_jobsarray-416-422-12.err  #(%A = $SLURM_ARRAY_JOB_ID = 416, %j = $SLURM_JOBID = 422, %a = $SLURM_ARRAY_TASK_ID = 12)
test_jobsarray-416-422-12.out
test_jobsarray-416-423-14.err  #(%A = $SLURM_ARRAY_JOB_ID = 416, %j = $SLURM_JOBID = 423, %a = $SLURM_ARRAY_TASK_ID = 14)
test_jobsarray-416-423-14.out
test_jobsarray-416-424-16.err  #(%A = $SLURM_ARRAY_JOB_ID = 416, %j = $SLURM_JOBID = 424, %a = $SLURM_ARRAY_TASK_ID = 16)
test_jobsarray-416-424-16.out
test_jobsarray-416-425-18.err  #(%A = $SLURM_ARRAY_JOB_ID = 416, %j = $SLURM_JOBID = 425, %a = $SLURM_ARRAY_TASK_ID = 18)
test_jobsarray-416-425-18.out

If you try the squeue command, you will see each task in a different line and the job ID will be formed by two values: XX_YY where XX is the job array ID and YY is the task ID. If needed, you can cancel all tasks or just some of them. For instance, try:

[lapalma1]$ scancel 416
[lapalma1]$ scancel 416_2
[lapalma1]$ scancel 416_[6-8]
[lapalma1]$ scancel 416_8 416_16

For further information, check the jobs array documentation