HOWTOs

LaPalma3 (4): Submit Script files

Please note that all the SIEpedia's articles address specific issues or questions raised by IAC users, so they do not attempt to be rigorous or exhaustive, and may or may not be useful or applicable in different or more general contexts.

IMPORTANT: This documentation is deprecated. It will not be further updated. The new documentation for LaPalma can be found here for external users or here if you are connected to IAC's internal network.

Submit script files

Introduction

SLURM is the utility used at LaPalma for batch processing support, so all jobs must be run through it. This document provides information for getting started with job execution at LaPalma. Here we will describe the most important options and some examples of submission script files are available, but we recommend you check the SLURM Quick Start User Guide.

In order to keep the login nodes in a proper load, a 10 minutes limitation in the cpu time is set for processes running interactively in these nodes. Any execution taking more than this limit should be carried out through the queue system (see this FAQ).

Queues (QOS)

The user's limits are assigned automatically to each particular user (depending on the resources granted by the Access Committee). Anyway you are allowed to use the special queue debug in order to perform some fast short tests.

Queues Max CPUs Wall time limit
class_a 2400 72 hours
class_b 1200 48 hours
class_c 1200 24 hours
debug 64 30 min
interactive 1 1 hour


The specific limits assigned to each user depends on the priority granted by the access committee. Users granted with high priority hours will have access to a maximum of 2400 CPUs and a maximum wall clock limit of 72 hours. For users with low priority hours the limits are 1200 CPUs and 24 hours. If you need to increase these limits please contact the support group.

  • class_a, class_b and class_c: Queues assigned by the access committee and where normal jobs will be executed, no special directive is needed to use these queues, they will be automatically assigned.
  • debug: This queue is reserved for testing the applications before submitting them to the production queues. Only one job per user is allowed to run simultaneously in this queue, and the execution time will be limited to 30 minutes. The maximum number of cpus per application is 64. Only a limited number of jobs may be running at the same time in this queue. To use this queue add a directive in your script file, or also specify the queue when submitting without changing the script:
   #SBATCH --qos=debug
    - or -
   [lapalma1]$ sbatch --qos=debug script.sub
  • interactive: Jobs submitted to this queue will run in the interactive (login) node. It is intended to run GUI applications that may exceed the interactive cpu time limit. Note that only sequential jobs are allowed. To use this queue launch the following command from login1 (see this FAQ):
   [lapalma1]$ salloc -p interactive

Submission directives

A job must contain a series of directives to inform the batch system about the characteristics of the job. These directives appear as comments in the job script, with the following syntax:

  #SBATCH --directive=<value>

Some common directives have a shorter version, you can use both forms:

  #SBATCH -d <value> 

Additionally, the job script may contain a set of commands to execute. If not, an external script must be provided with the 'executable' directive. Here you may find the most common directives (complete list here):

  • -J ...: Name of the job
  • --qos <queue_name>: The queue where the job is to be submitted. Let this field empty unless you need to use debug queue
  • -t ...: walltime (use format hh:mm:ss or days-hh:mm:ss)
  • -n ...: Number of tasks, this is the normal way to specify how many cores you want to use
  • -o /path/to/file_out: Redirect standard output (stdout) to file_out (use /dev/null to ignore this output)
  • -e /path/to/file_err: Redirect error output (stderr) to file_err (use /dev/null to ignore this output)
  • -D <directory>: Execution will be performed in the specified directory (if it is not set, current directory will be used)

NOTES:

  • Walltime (the limit of wall clock time) must be set using format HH:MM:SS or DD-HH:MM:SS to a value greater than the real execution time for your application, bear in mind that your job will be killed after the period you specified. Shorter limits are likely to reduce the waiting time in the queue. If you do not specify any time limit, the maximum available in your assigned queue will be used.
  • To avoid overwriting the standard and error output files if you submit several jobs, add %j to the filenames in order to automatically include the job Id in it (see examples)
  • You can use the script idlenodes to know the number of idle nodes at that moment, that could be useful when deciding how many nodes you could ask for in order to wait less time in the queue (also you can use idlecores to know the number of idle cores, that is 16 times the number of idle nodes).
  • If you are running hybrid MPI+OpenMP applications, where each process will spawn a number of threads, use --cpus-per-task=<number> to specify the number of cpus allocated for each task (it must be an integer between 1 and 16, since each node has 16 cores), and then accordingly set the number of tasks per node with --ntasks-per-node=<ntasks> (and/or the number of tasks per core with --ntasks-per-core=<ntasks>, if needed). In this case, it could be also useful to specify the number of total nodes with -N, instead of the number of tasks with -n.
  • Each node has 32 GB of memory, so when an application uses more than 1.7 GB of memory per process, it is not possible to have 16 processes in the same node. Then you can combine --ntasks-per-node and --cpus-per-task directives to run less processes per node, so each of them will have more available memory (in this case some cores will stay idle, but they will still count to calculate the total consumed time, so try to minimize the wasted cores).
  • Before submitting large jobs, please, perform some short tests to make sure your program is running fine. When running your jobs, check outputs and logs from time to time, and cancel the job if application fails.
  • There many more options, like specifying dependencies among jobs with -d directive, giving a starting time with --begin, automatically requeue job if it fails with --requeue, etc. (see complete list here and also this FAQ).



How to specify the submission options

You can specify these options in the command line:

   [lapalma1]$ sbatch -J <job_name> -t <days-HH:MM:SS> <your_executable>

But we highly recommend you write all the commands in a file (called submission script file) so you can reuse it when needed. That file should have next sections:

  1. Submission file must be a executable script (although no execute permit is needed) beginning with line #!/bin/bash
  2. SLURM options (as many as needed): #SBATCH -directive [<value>]
  3. Modules to be loaded (as many as needed). Your environment variables will be stored when submitting your job and then used when executing the program. This could be a problem if your environment at submission time is not the proper one to execute your programs (for instance, no path to executables or dynamic libraries are set), so we recommend you begin cleaning your environment with module purge and then load only the modules required by your program.
  4. Shell commands needed to run your application.


Once your script file is ready, you only need to use next command to submit it to the queue:

   [lapalma1]$ sbatch script_file

If you need more information about how to manage your jobs, check also the Useful Commands (executions) and the FAQs.

Environment variables

Although is not needed in most situations, there are also some SLURM environment variables you can use in your scripts if you need them

Variable Meaning
SLURM_JOBID Specifies the job ID of the executing job
SLURM_NPROCS Specifies the total number of processes in the job
SLURM_NNODES Is the actual number of nodes assigned to run your job
SLURM_PROCID Specifies the MPI rank (or relative process ID) for the current process. The range is from 0-(SLURM_NPROCS-1)
SLURM_NODEID Specifies relative node ID of the current job. The range is from 0-(SLURM_NNODES-1)
SLURM_LOCALID Specifies the node-local task ID for the process within a job
SLURM_NODELIST Specifies the list of nodes on which the job is actually running
SLURM_ARRAY_TASK_ID Task ID inside the job array
SLURM_ARRAY_JOB_ID Job ID (it will be the same for all jobs array, the same as the SLURM_JOBID of the first task).



Examples of submission script files

Here you will find some examples about script files for different situations.

Note: If you copy and paste these examples, be careful because some unwanted spaces may be added at the beginning of each line: make sure that lines that contain parameters begin with #SBATCH and there are no spaces before these symbols.

Basic example (MPI)

You want to run your MPI program called myprogram_mpi using 64 cores (4 nodes) and that should take about 5 hours (add always some extra time because your application will be killed if it overpasses this wall time limit):

  #!/bin/bash
  #############################
  #SBATCH -J test_mpi
  #SBATCH -n 64
  #SBATCH -t 05:30:00
  #SBATCH -o %x-%j.out
  #SBATCH -e %x-%j.err
  #SBATCH -D .
  #############################

  module purge
  module load gnu openmpi/gnu

  mpirun ./myprogram_mpi

  # Use these other options if your MPI program does not run properly
  # srun  ./myprogram_mpi
  # srun --mpi=pmi2 ./myprogram_mpi

Comments:

  • -n 64: this script will run the application myprogram_mpi on 64 cores (4 nodes)
  • -D . The working directory will be the current one (where the submission was performed from)
  • -o and -e: Two output files will be created, one for the standard output (-o, extension .out) and another one for errors (-e, extension .err). Note that we have used %x, so those files will be named using the job name specified with -J (test_mpi). We have also included the parameter %j in the names of those files, so the job ID will be added to them, in order to avoid overwriting the output files if we execute several times this script, since each execution will have a different job ID (this ID is shown when you submit the script using sbatch. You can also get it using squeue once the submission is done and the job has not finished yet). For instance, if your job name was test_mpi and the job ID was 1234, files will be named test_mpi-1234.out and test_mpi-1234.err.
  • Remember that you cannot directly run MPI programs, you need to use srun or mpirun to execute them. If no more arguments are added, the number of slots specified by the SBATCH parameters will be used, but you can also force value using srun -n 20, srun -n $SLURM_NTASKS, etc. If you have problems running MPI programs (they do not initialize, or they are executed sequentially, change the command or options to run your program, you can use one of the next ones: mpirun, srun, srun --mpi=pmi2, etc.
  • Do not forget to load all needed modules. For instance, if you want to execute VASP, you will need to use next commands and later use mpirun to run VASP:
   module purge 
   module load intel mkl vasp 

Basic example (OpenMP)

You want to run your OpenMP program called myprogram_omp (written in C or fortran) using 16 slots (this is the maximum number of available slots with shared memory to run OpenMP). This program should take about 30 minutes (add always some extra time because your application will be killed if it overpasses this wall time limit)

  #!/bin/bash
  #############################
  #SBATCH -J test_omp
  #SBATCH -n 16
  #SBATCH -t 00:45:00
  #SBATCH -o %x-%j.out
  #SBATCH -e %x-%j.err
  #SBATCH -D .
  #############################

  module purge
  module load gnu

  export OMP_NUM_THREADS=16
  ./myprogram_omp

Comments:

  • CAUTION: Be sure you execute your OpenMP programs directly, and you do NOT use mpirun or srun (unless you have a hybrid MPI-OpenMP), since using them several instance of your OpenMP program will be repeated.
  • Using this script our application myprogram_omp will be executed with 16 slots in one node (setting OMP_NUM_THREADS to 16 is not really needed, since by default the number of tasks will be used. If for any reason you want to execute with a different number of threads, then you can use this variable to set it).
  • The working directory will be the current one (where the submission was performed from).
  • -o and -e: Two output files will be created, one for the standard output (-o, extension .out) and another one for errors (-e, extension .err). Note that we have used %x, so those files will be named using the job name specified with -J (test_omp). We have also included the parameter %j in the names of those files, so the job ID will be added to them, in order to avoid overwriting the output files if we execute several times this script, since each execution will have a different job ID (this ID is shown when you submit the script using sbatch. You can also get it using squeue once the submission is done and the job has not finished yet). For instance, if your job name was test_omp and the job ID was 1234, files will be named test_omp-1234.out and test_omp-1234.err.

Jobs array

Jobs array and task generation can be used to run applications over different inputs like you were able to do with GREASY in past version of LaPalma.

VERY IMPORTANT: You have to be extremely cautious when using jobs array to run your sequential programs. If you do not use the proper script file, you could eventually run your sequential program using a complete node, so you will be using only one core and the remaining 15 cores will be wasted, and, what is worse, the consumed time will be x16 the real one.

Please, choose the proper script depending on your programs (parallel or sequential).

Example of Job Array (parallel programs)

For instance, assume that you have 10 different input files (named input000.dat, input002.dat, input004.dat, ..., input018.dat) and you want to process each file with your MPI parallel program. Each execution will use 32 cores and should not take more than 1 hour to finish (we will add some extra time just to be sure). Then, your script should be similar to the next one:

  #!/bin/bash
  ##########################################################
  #SBATCH -J test_MPI_jobsarray
  #SBATCH -n 32
  #SBATCH -t 0-1:10:00
  #SBATCH --array=0-18:2
  #SBATCH -o test_jobsarray-%A-%j-%a.out
  #SBATCH -e test_jobsarray-%A-%j-%a.err
  #SBATCH -D .
  ##########################################################

  module purge
  module load gnu openmpi/gnu

  echo "#1 EXECUTING TASK ID: $SLURM_ARRAY_TASK_ID"
  fmtID=$(printf "%03d" $SLURM_ARRAY_TASK_ID)
  srun ./mpi_program -i input$fmtID.dat

Let us explain the parameters that we have used:

  • -n 32 is used to specify that we want that each task will be executed with 32 cores.
  • With --array parameter we specify that we want that our job generates task. The task ID will be generated with format --array=ini-end:step, so using 0-18:2 the IDs will be 0, 2, 4, ..., 18. Some more examples:
    • --array=1,4,5,8,12,65 will produce IDs 1, 4, 5, 8, 12, 65
    • --array=1-10 will produce IDs 1, 2, 3, ..., 10 (step can be omitted when its value is 1)
    • IMPORTANT: Note that we are asking for 10 tasks x 32 cores = 320 cores. Only submit large number of tasks when you are totally sure that everything is working fine. If you are still testing, submit only 2 or 3 short tasks to avoid wasting resources. If you need to submit a really large number of tasks, please, consider limiting the maximum number of simultaneously running tasks. That could be done using syntax --array=ini-end%limit or --array=ini-end:step%limit.
  • We have used the environment variable $SLURM_ARRAY_TASK_ID to access the task ID. We have converted the original format of the task ID assigned by SLURM (0, 2, 4, ..., 18) to the required format (000, 002, 004, ..., 018) using bash printf function, storing that value in fmtID (note that the right syntax is fmtID=$(...), with no space before of after the "=" symbol).
  • We have named our files test_arrayjobs-%A-%j-%a:
    • %A: The job array ID: it will have a fixed value, the same that the job submission. You can access to this value in your script using environment variable called $SLURM_ARRAY_JOB_ID.
    • %j: The job ID, the first task will have the value given when submitting the job, and then next tasks will have successive values incremented by one. You can access to this value in your script using environment variable called $SLURM_JOBID.
    • %a: The task ID: it will have the values that you specified using -array parameter. You can access to this value in your script using environment variable called $SLURM_ARRAY_TASK_ID, this is the variable that you will typically use to specify your input files or arguments.

For instance, if you submit the previous example and got the next message:

  Submitted batch job 416

Then the generated files will be the next ones:

  test_jobsarray-416-416-0.err   #(%A = $SLURM_ARRAY_JOB_ID = 416, %j = $SLURM_JOBID = 416, %a = $SLURM_ARRAY_TASK_ID = 0)
  test_jobsarray-416-416-0.out
  test_jobsarray-416-417-2.err   #(%A = $SLURM_ARRAY_JOB_ID = 416, %j = $SLURM_JOBID = 417, %a = $SLURM_ARRAY_TASK_ID = 2)
  test_jobsarray-416-417-2.out
  test_jobsarray-416-418-4.err   #(%A = $SLURM_ARRAY_JOB_ID = 416, %j = $SLURM_JOBID = 418, %a = $SLURM_ARRAY_TASK_ID = 4)
  test_jobsarray-416-418-4.out
  test_jobsarray-416-419-6.err   #(%A = $SLURM_ARRAY_JOB_ID = 416, %j = $SLURM_JOBID = 419, %a = $SLURM_ARRAY_TASK_ID = 6)
  test_jobsarray-416-419-6.out
  test_jobsarray-416-420-8.err   #(%A = $SLURM_ARRAY_JOB_ID = 416, %j = $SLURM_JOBID = 420, %a = $SLURM_ARRAY_TASK_ID = 8)
  test_arrayjobs-416-420-8.out
  test_arrayjobs-416-421-10.err  #(%A = $SLURM_ARRAY_JOB_ID = 416, %j = $SLURM_JOBID = 421, %a = $SLURM_ARRAY_TASK_ID = 10)
  test_jobsarray-416-421-10.out
  test_jobsarray-416-422-12.err  #(%A = $SLURM_ARRAY_JOB_ID = 416, %j = $SLURM_JOBID = 422, %a = $SLURM_ARRAY_TASK_ID = 12)
  test_jobsarray-416-422-12.out
  test_jobsarray-416-423-14.err  #(%A = $SLURM_ARRAY_JOB_ID = 416, %j = $SLURM_JOBID = 423, %a = $SLURM_ARRAY_TASK_ID = 14)
  test_jobsarray-416-423-14.out
  test_jobsarray-416-424-16.err  #(%A = $SLURM_ARRAY_JOB_ID = 416, %j = $SLURM_JOBID = 424, %a = $SLURM_ARRAY_TASK_ID = 16)
  test_jobsarray-416-424-16.out
  test_jobsarray-416-425-18.err  #(%A = $SLURM_ARRAY_JOB_ID = 416, %j = $SLURM_JOBID = 425, %a = $SLURM_ARRAY_TASK_ID = 18)
  test_jobsarray-416-425-18.out



If you try the squeue command, you will see each task in a different line and the job ID will be formed by two values: XX_YY where XX is the job array ID and YY is the task ID. If needed, you can cancel all tasks or just some of them. For instance, try:

  [lapalma1]$ scancel 416
  [lapalma1]$ scancel 416_2
  [lapalma1]$ scancel 416_[6-8]
  [lapalma1]$ scancel 416_8 416_16

For further information, check the jobs array documentation

Example of Jobs Array (sequential programs)

VERY IMPORTANT: You need to make sure that you are using all the 16 cores of each node when using jobs array with sequential programs. If you have doubts about this, please, contact us before submitting your jobs, because wrong submit files will execute only one job per node, so you may block a huge amount of nodes (and use only one core in each of them, wasting the remaining 15).

For example, we want to run our application "my_seq_program" with an integer argument from 1 to 64, using also an input file (input1.dat, input2.dat, ..., input64.dat). Then the script will be like the following one:

 #!/bin/bash 
 ##############################################################################
 #SBATCH -J test_SEQ_jobsarray
 #SBATCH -n 16
 #SBATCH -t 0-0:30:00
 #SBATCH -o test_jobsarray_seq-%A-%j-%a.out
 #SBATCH -e test_jobsarray_seq-%A-%j-%a.err
 #SBATCH -D .
 #SBATCH --array=0-3:1
 ##########################################################

  module purge
  module load gnu

 # Specify how many executions will be performed in each node
 ''# Use N=16 to run an execution per core
 N=16

 for ((i=$SLURM_ARRAY_TASK_ID*$N+1; i <= $SLURM_ARRAY_TASK_ID*$N+$N; i++))
 do
   # Run your program. Make sure you use the & symbol to run your executions
   # in background, distributing them among the cores.
   # Use $i to specify the params and arguments
   ./my_seq_program -var=$i -file=input$i.dat &

   # Wait for completion after last iteration ($i % $N == 0)
   if ! expr $i % $N > /dev/null
   then
     wait
   fi
 done

Notes:

  • Since we use N=16, each node will run 16 sequential executions, one per each core. Then we will need 4 tasks to run the program 64 times. The 4 tasks are specified using jobs array (--array=0-3:1).
  • We use the for loop to manage the executions using all slots of each node, the value of each iteration is stored in $i and we will use it to specify arguments and input files:
    • Task 0 ($SLURM_ARRAY_TASK_ID=0) will execute iterations $i: from 1 to 16
    • Task 1 ($SLURM_ARRAY_TASK_ID=1) will execute iterations $i: from 17 to 32
    • Task 2 ($SLURM_ARRAY_TASK_ID=2) will execute iterations $i: from 33 to 48
    • Task 3 ($SLURM_ARRAY_TASK_ID=3) will execute iterations $i: from 49 to 64
  • Programs are run in batch (note the "&" symbol).
  • We use wait command after the last execution ($i % $N == 0) of each task in order to wait for completion.