LaPalma3 (4): Submit Script files
Please note that all the SIEpedia's articles address specific issues or questions raised by IAC users, so they do not attempt to be rigorous or exhaustive, and may or may not be useful or applicable in different or more general contexts.
Introduction | Connecting | Useful Commands (preparations) | Useful Commands (executions) | Script files | FAQs |
IMPORTANT: This documentation is deprecated. It will not be further updated. The new documentation for LaPalma can be found here for external users or here if you are connected to IAC's internal network.
Submit script files
Introduction
SLURM is the utility used at LaPalma for batch processing support, so all jobs must be run through it. This document provides information for getting started with job execution at LaPalma. Here we will describe the most important options and some examples of submission script files are available, but we recommend you check the SLURM Quick Start User Guide.
In order to keep the login nodes in a proper load, a 10 minutes limitation in the cpu time is set for processes running interactively in these nodes. Any execution taking more than this limit should be carried out through the queue system (see this FAQ).
Queues (QOS)
The user's limits are assigned automatically to each particular user (depending on the resources granted by the Access Committee). Anyway you are allowed to use the special queue debug
in order to perform some fast short tests.
Queues | Max CPUs | Wall time limit |
class_a
| 2400 | 72 hours |
class_b
| 1200 | 48 hours |
class_c
| 1200 | 24 hours |
debug
| 64 | 30 min |
interactive
| 1 | 1 hour |
The specific limits assigned to each user depends on the priority granted by the access committee. Users granted with high priority hours will have access to a maximum of 2400 CPUs and a maximum wall clock limit of 72 hours. For users with low priority hours the limits are 1200 CPUs and 24 hours. If you need to increase these limits please contact the support group.
class_a
,class_b
andclass_c
: Queues assigned by the access committee and where normal jobs will be executed, no special directive is needed to use these queues, they will be automatically assigned.debug
: This queue is reserved for testing the applications before submitting them to the production queues. Only one job per user is allowed to run simultaneously in this queue, and the execution time will be limited to 30 minutes. The maximum number of cpus per application is 64. Only a limited number of jobs may be running at the same time in this queue. To use this queue add a directive in your script file, or also specify the queue when submitting without changing the script:
#SBATCH --qos=debug
- or - [lapalma1]$sbatch --qos=debug script.sub
interactive
: Jobs submitted to this queue will run in the interactive (login) node. It is intended to run GUI applications that may exceed the interactive cpu time limit. Note that only sequential jobs are allowed. To use this queue launch the following command from login1 (see this FAQ):
[lapalma1]$ salloc -p interactive
Submission directives
A job must contain a series of directives to inform the batch system about the characteristics of the job. These directives appear as comments in the job script, with the following syntax:
#SBATCH --directive=<value>
Some common directives have a shorter version, you can use both forms:
#SBATCH -d <value>
Additionally, the job script may contain a set of commands to execute. If not, an external script must be provided with the 'executable' directive. Here you may find the most common directives (complete list here):
-J ...
: Name of the job--qos <queue_name>
: The queue where the job is to be submitted. Let this field empty unless you need to usedebug
queue-t ...
: walltime (use formathh:mm:ss
ordays-hh:mm:ss
)-n ...
: Number of tasks, this is the normal way to specify how many cores you want to use-o /path/to/file_out
: Redirect standard output (stdout
) tofile_out
(use/dev/null
to ignore this output)-e /path/to/file_err
: Redirect error output (stderr
) tofile_err
(use/dev/null
to ignore this output)-D <directory>
: Execution will be performed in the specified directory (if it is not set, current directory will be used)
NOTES:
- Walltime (the limit of wall clock time) must be set using format
HH:MM:SS
orDD-HH:MM:SS
to a value greater than the real execution time for your application, bear in mind that your job will be killed after the period you specified. Shorter limits are likely to reduce the waiting time in the queue. If you do not specify any time limit, the maximum available in your assigned queue will be used. - To avoid overwriting the standard and error output files if you submit several jobs, add
%j
to the filenames in order to automatically include the job Id in it (see examples) - You can use the script
idlenodes
to know the number of idle nodes at that moment, that could be useful when deciding how many nodes you could ask for in order to wait less time in the queue (also you can useidlecores
to know the number of idle cores, that is 16 times the number of idle nodes). - If you are running hybrid MPI+OpenMP applications, where each process will spawn a number of threads, use
--cpus-per-task=<number>
to specify the number of cpus allocated for each task (it must be an integer between 1 and 16, since each node has 16 cores), and then accordingly set the number of tasks per node with--ntasks-per-node=<ntasks>
(and/or the number of tasks per core with--ntasks-per-core=<ntasks>
, if needed). In this case, it could be also useful to specify the number of total nodes with-N
, instead of the number of tasks with-n
. - Each node has 32 GB of memory, so when an application uses more than 1.7 GB of memory per process, it is not possible to have 16 processes in the same node. Then you can combine
--ntasks-per-node
and--cpus-per-task
directives to run less processes per node, so each of them will have more available memory (in this case some cores will stay idle, but they will still count to calculate the total consumed time, so try to minimize the wasted cores). - Before submitting large jobs, please, perform some short tests to make sure your program is running fine. When running your jobs, check outputs and logs from time to time, and cancel the job if application fails.
- There many more options, like specifying dependencies among jobs with
-d
directive, giving a starting time with--begin
, automatically requeue job if it fails with--requeue
, etc. (see complete list here and also this FAQ).
How to specify the submission options
You can specify these options in the command line:
[lapalma1]$sbatch
-J
<job_name>-t
<days-HH:MM:SS> <your_executable>
But we highly recommend you write all the commands in a file (called submission script file) so you can reuse it when needed. That file should have next sections:
- Submission file must be a executable script (although no execute permit is needed) beginning with line
#!/bin/bash
- SLURM options (as many as needed):
#SBATCH -directive [<value>]
- Modules to be loaded (as many as needed). Your environment variables will be stored when submitting your job and then used when executing the program. This could be a problem if your environment at submission time is not the proper one to execute your programs (for instance, no path to executables or dynamic libraries are set), so we recommend you begin cleaning your environment with
module purge
and then load only the modules required by your program. - Shell commands needed to run your application.
Once your script file is ready, you only need to use next command to submit it to the queue:
[lapalma1]$ sbatch
script_file
If you need more information about how to manage your jobs, check also the Useful Commands (executions) and the FAQs.
Environment variables
Although is not needed in most situations, there are also some SLURM environment variables you can use in your scripts if you need them
Variable | Meaning |
SLURM_JOBID
| Specifies the job ID of the executing job |
SLURM_NPROCS
| Specifies the total number of processes in the job |
SLURM_NNODES
| Is the actual number of nodes assigned to run your job |
SLURM_PROCID
| Specifies the MPI rank (or relative process ID) for the current process. The range is from 0-(SLURM_NPROCS -1)
|
SLURM_NODEID
| Specifies relative node ID of the current job. The range is from 0-(SLURM_NNODES-1 )
|
SLURM_LOCALID
| Specifies the node-local task ID for the process within a job |
SLURM_NODELIST
| Specifies the list of nodes on which the job is actually running |
SLURM_ARRAY_TASK_ID
| Task ID inside the job array |
SLURM_ARRAY_JOB_ID
| Job ID (it will be the same for all jobs array, the same as the SLURM_JOBID of the first task).
|
Examples of submission script files
Here you will find some examples about script files for different situations.
Note: If you copy and paste these examples, be careful because some unwanted spaces may be added at the beginning of each line: make sure that lines that contain parameters begin with #SBATCH
and there are no spaces before these symbols.
Basic example (MPI)
You want to run your MPI
program called myprogram_mpi
using 64 cores (4 nodes) and that should take about 5 hours (add always some extra time because your application will be killed if it overpasses this wall time limit):
#!/bin/bash
##############################SBATCH -J
test_mpi#SBATCH -n
64#SBATCH -t
05:30:00#SBATCH -o
%x
-%j
.out#SBATCH -e
%x
-%j
.err#SBATCH -D
. #############################module purge
module load
gnu openmpi/gnumpirun
./myprogram_mpi # Use these other options if your MPI program does not run properly #srun
./myprogram_mpi #srun --mpi=pmi2
./myprogram_mpi
Comments:
-n 64
: this script will run the applicationmyprogram_mpi
on 64 cores (4 nodes)-D .
The working directory will be the current one (where the submission was performed from)-o
and-e
: Two output files will be created, one for the standard output (-o
, extension.out
) and another one for errors (-e
, extension.err
). Note that we have used%x
, so those files will be named using the job name specified with-J
(test_mpi
). We have also included the parameter%j
in the names of those files, so the job ID will be added to them, in order to avoid overwriting the output files if we execute several times this script, since each execution will have a different job ID (this ID is shown when you submit the script usingsbatch
. You can also get it usingsqueue
once the submission is done and the job has not finished yet). For instance, if your job name wastest_mpi
and the job ID was1234
, files will be namedtest_mpi-1234.out
andtest_mpi-1234.err
.- Remember that you cannot directly run MPI programs, you need to use
srun
ormpirun
to execute them. If no more arguments are added, the number of slots specified by theSBATCH
parameters will be used, but you can also force value usingsrun -n 20
,srun -n $SLURM_NTASKS
, etc. If you have problems running MPI programs (they do not initialize, or they are executed sequentially, change the command or options to run your program, you can use one of the next ones:mpirun
,srun
,srun --mpi=pmi2
, etc. - Do not forget to load all needed modules. For instance, if you want to execute VASP, you will need to use next commands and later use
mpirun
to run VASP:
module purge
module load intel mkl vasp
Basic example (OpenMP)
You want to run your OpenMP
program called myprogram_omp
(written in C
or fortran
) using 16 slots (this is the maximum number of available slots with shared memory to run OpenMP). This program should take about 30 minutes (add always some extra time because your application will be killed if it overpasses this wall time limit)
#!/bin/bash
##############################SBATCH -J
test_omp#SBATCH -n
16#SBATCH -t
00:45:00#SBATCH -o
%x
-%j
.out#SBATCH -e
%x
-%j
.err#SBATCH -D
. #############################module purge
module load
gnuexport
OMP_NUM_THREADS=16 ./myprogram_omp
Comments:
- CAUTION: Be sure you execute your OpenMP programs directly, and you do NOT use
mpirun
orsrun
(unless you have a hybrid MPI-OpenMP), since using them several instance of your OpenMP program will be repeated. - Using this script our application
myprogram_omp
will be executed with 16 slots in one node (settingOMP_NUM_THREADS
to 16 is not really needed, since by default the number of tasks will be used. If for any reason you want to execute with a different number of threads, then you can use this variable to set it). - The working directory will be the current one (where the submission was performed from).
-o
and-e
: Two output files will be created, one for the standard output (-o
, extension.out
) and another one for errors (-e
, extension.err
). Note that we have used%x
, so those files will be named using the job name specified with-J
(test_omp
). We have also included the parameter%j
in the names of those files, so the job ID will be added to them, in order to avoid overwriting the output files if we execute several times this script, since each execution will have a different job ID (this ID is shown when you submit the script usingsbatch
. You can also get it usingsqueue
once the submission is done and the job has not finished yet). For instance, if your job name wastest_omp
and the job ID was1234
, files will be namedtest_omp-1234.out
andtest_omp-1234.err
.
Jobs array
Jobs array and task generation can be used to run applications over different inputs like you were able to do with GREASY in past version of LaPalma.
VERY IMPORTANT: You have to be extremely cautious when using jobs array to run your sequential programs. If you do not use the proper script file, you could eventually run your sequential program using a complete node, so you will be using only one core and the remaining 15 cores will be wasted, and, what is worse, the consumed time will be x16 the real one.
Please, choose the proper script depending on your programs (parallel or sequential).
Example of Job Array (parallel programs)
For instance, assume that you have 10 different input files (named input000.dat
, input002.dat
, input004.dat
, ..., input018.dat
) and you want to process each file with your MPI parallel program. Each execution will use 32 cores and should not take more than 1 hour to finish (we will add some extra time just to be sure). Then, your script should be similar to the next one:
#!/bin/bash
###########################################################SBATCH -J
test_MPI_jobsarray#SBATCH -n
32#SBATCH -t
0-1:10:00#SBATCH --array
=0-18:2#SBATCH -o
test_jobsarray-%A-%j-%a
.out#SBATCH -e
test_jobsarray-%A-%j-%a
.err#SBATCH -D
. ##########################################################module purge
module load
gnu openmpi/gnuecho
"#1 EXECUTING TASK ID:$SLURM_ARRAY_TASK_ID
" fmtID=$(printf
"%03d"$SLURM_ARRAY_TASK_ID)
srun
./mpi_program -i input$fmtID
.dat
Let us explain the parameters that we have used:
-n 32
is used to specify that we want that each task will be executed with 32 cores.- With
--array
parameter we specify that we want that our job generates task. The task ID will be generated with format--array=ini-end:step
, so using0-18:2
the IDs will be0, 2, 4, ..., 18
. Some more examples:--array=1,4,5,8,12,65
will produce IDs1, 4, 5, 8, 12, 65
--array=1-10
will produce IDs1, 2, 3, ..., 10
(step
can be omitted when its value is1
)- IMPORTANT: Note that we are asking for 10 tasks x 32 cores = 320 cores. Only submit large number of tasks when you are totally sure that everything is working fine. If you are still testing, submit only 2 or 3 short tasks to avoid wasting resources. If you need to submit a really large number of tasks, please, consider limiting the maximum number of simultaneously running tasks. That could be done using syntax
--array=ini-end%limit
or--array=ini-end:step%limit
.
- We have used the environment variable
$SLURM_ARRAY_TASK_ID
to access the task ID. We have converted the original format of the task ID assigned by SLURM (0, 2, 4, ..., 18
) to the required format (000, 002, 004, ..., 018
) using bashprintf
function, storing that value infmtID
(note that the right syntax isfmtID=$(...)
, with no space before of after the "=
" symbol). - We have named our files
test_arrayjobs-%A-%j-%a
:%A
: The job array ID: it will have a fixed value, the same that the job submission. You can access to this value in your script using environment variable called$SLURM_ARRAY_JOB_ID
.%j
: The job ID, the first task will have the value given when submitting the job, and then next tasks will have successive values incremented by one. You can access to this value in your script using environment variable called$SLURM_JOBID
.%a
: The task ID: it will have the values that you specified using-array
parameter. You can access to this value in your script using environment variable called$SLURM_ARRAY_TASK_ID
, this is the variable that you will typically use to specify your input files or arguments.
For instance, if you submit the previous example and got the next message:
Submitted batch job 416
Then the generated files will be the next ones:
test_jobsarray-416-416-0.err #(%A = $SLURM_ARRAY_JOB_ID = 416
,%j = $SLURM_JOBID = 416
,%a = $SLURM_ARRAY_TASK_ID = 0
) test_jobsarray-416-416-0.out test_jobsarray-416-417-2.err #(%A = $SLURM_ARRAY_JOB_ID = 416
,%j = $SLURM_JOBID = 417
,%a = $SLURM_ARRAY_TASK_ID = 2
) test_jobsarray-416-417-2.out test_jobsarray-416-418-4.err #(%A = $SLURM_ARRAY_JOB_ID = 416
,%j = $SLURM_JOBID = 418
,%a = $SLURM_ARRAY_TASK_ID = 4
) test_jobsarray-416-418-4.out test_jobsarray-416-419-6.err #(%A = $SLURM_ARRAY_JOB_ID = 416
,%j = $SLURM_JOBID = 419
,%a = $SLURM_ARRAY_TASK_ID = 6
) test_jobsarray-416-419-6.out test_jobsarray-416-420-8.err #(%A = $SLURM_ARRAY_JOB_ID = 416
,%j = $SLURM_JOBID = 420
,%a = $SLURM_ARRAY_TASK_ID = 8
) test_arrayjobs-416-420-8.out test_arrayjobs-416-421-10.err #(%A = $SLURM_ARRAY_JOB_ID = 416
,%j = $SLURM_JOBID = 421
,%a = $SLURM_ARRAY_TASK_ID = 10
) test_jobsarray-416-421-10.out test_jobsarray-416-422-12.err #(%A = $SLURM_ARRAY_JOB_ID = 416
,%j = $SLURM_JOBID = 422
,%a = $SLURM_ARRAY_TASK_ID = 12
) test_jobsarray-416-422-12.out test_jobsarray-416-423-14.err #(%A = $SLURM_ARRAY_JOB_ID = 416
,%j = $SLURM_JOBID = 423
,%a = $SLURM_ARRAY_TASK_ID = 14
) test_jobsarray-416-423-14.out test_jobsarray-416-424-16.err #(%A = $SLURM_ARRAY_JOB_ID = 416
,%j = $SLURM_JOBID = 424
,%a = $SLURM_ARRAY_TASK_ID = 16
) test_jobsarray-416-424-16.out test_jobsarray-416-425-18.err #(%A = $SLURM_ARRAY_JOB_ID = 416
,%j = $SLURM_JOBID = 425
,%a = $SLURM_ARRAY_TASK_ID = 18
) test_jobsarray-416-425-18.out
If you try the squeue
command, you will see each task in a different line and the job ID will be formed by two values: XX_YY
where XX
is the job array ID and YY
is the task ID. If needed, you can cancel all tasks or just some of them. For instance, try:
[lapalma1]$scancel 416
[lapalma1]$scancel 416_2
[lapalma1]$scancel 416_[6-8]
[lapalma1]$scancel 416_8 416_16
For further information, check the jobs array documentation
Example of Jobs Array (sequential programs)
VERY IMPORTANT: You need to make sure that you are using all the 16 cores of each node when using jobs array with sequential programs. If you have doubts about this, please, contact us before submitting your jobs, because wrong submit files will execute only one job per node, so you may block a huge amount of nodes (and use only one core in each of them, wasting the remaining 15).
For example, we want to run our application "my_seq_program
" with an integer argument from 1
to 64
, using also an input file (input1.dat
, input2.dat
, ..., input64.dat
). Then the script will be like the following one:
#!/bin/bash
###############################################################################SBATCH -J
test_SEQ_jobsarray#SBATCH -n
16#SBATCH -t
0-0:30:00#SBATCH -o
test_jobsarray_seq-%A-%j-%a
.out#SBATCH -e
test_jobsarray_seq-%A-%j-%a
.err#SBATCH -D
.#SBATCH --array
=0-3:1 ##########################################################module purge
module load
gnu # Specify how many executions will be performed in each node ''# Use N=16 to run an execution per coreN=16
for
((i=$SLURM_ARRAY_TASK_ID
*$N
+1; i <=$SLURM_ARRAY_TASK_ID
*$N
+$N
; i++))do
# Run your program. Make sure you use the&
symbol to run your executions # in background, distributing them among the cores. # Use$i
to specify the params and arguments ./my_seq_program -var=$i
-file=input$i
.dat&
# Wait for completion after last iteration ($i % $N == 0)if !
expr$i % $N >
/dev/nullthen
wait
fi
done
Notes:
- Since we use
N=16
, each node will run 16 sequential executions, one per each core. Then we will need 4 tasks to run the program 64 times. The 4 tasks are specified using jobs array (--array=0-3:1
). - We use the
for
loop to manage the executions using all slots of each node, the value of each iteration is stored in$i
and we will use it to specify arguments and input files:- Task
0
($SLURM_ARRAY_TASK_ID=0
) will execute iterations$i:
from1
to16
- Task
1
($SLURM_ARRAY_TASK_ID=1
) will execute iterations$i:
from17
to32
- Task
2
($SLURM_ARRAY_TASK_ID=2
) will execute iterations$i:
from33
to48
- Task
3
($SLURM_ARRAY_TASK_ID=3
) will execute iterations$i:
from49
to64
- Task
- Programs are run in batch (note the "
&
" symbol). - We use
wait
command after the last execution ($i % $N == 0
) of each task in order to wait for completion.