Useful Commands (executions)
Like other supercomputers, on LaPalma3 there is a Batch-queuing System (SLURM v21.08) that manages user's executions. Therefore, to run your application you have to send it to the queue specifying some parameters and the system will execute it when possible. Here there are the most useful commands to manage your jobs, but we recommend you check the SLURM Quick Start User Guide.
We have gathered a list of useful commands when working with LaPalma3:
Useful commands that you may need before executing your application (connecting to LaPalma3, transferring files, compiling, etc.) are listed at Useful Commands (preparations).
Submitting jobs
Submission is performed using sbatch
command specifying some
information like number of processors, parallel environment, location of
the executable file, etc. Although it is possible to specify this
information in the command line, the usual (and recommended) way is to
write all parameters in a script file that will be specify when
submitting. To know how to prepare your script files, please, check
these examples.
Once your script file is ready, try next command:
[lapalma1]$ sbatch <script_file>
When you submit the script file, your parameters will be checked and you will be informed if there are errors. If the job is accepted, you will receive some info, like the job id that you may need later to get further details about that job or even cancel it.
[lapalma1]$ sbatch mytest.sub
Submitted batch job 1234 # Your job id is 1234
If you just want to check if your submit script has no errors, or know
an estimation about when it will be probably executed, you can use
--test-only
flag. That will simulate the submission, but it will not
be performed:
[lapalma1]$ sbatch --test-only mytest.sub
Important
All parallel programs must be executed by the queue system. Do NOT attempt to run your parallel applications interactively on the login nodes. (see this FAQ)
There are a couple of scripts that show information about how many nodes (and cores) are idle at that moment. You might want to use that information when asking for resources:
# Show number of idle nodes:
[lapalma1]$ idlenodes
# Show number of idle cores (basically 16 times the number of nodes):
[lapalma1]$ idlecores
Note
Information shown by these commands could have a delay of some seconds in relation to the real current status of the queue
Checking the status of jobs
You can check the status of jobs using next commands:
Check status of jobs (you will see ONLY your own jobs)
[lapalma1]$ squeue
Using this command you will get useful information:
JOBID
: The ID of each job, this value will be required if you want to do some operations with one of those jobs (ask for further details, cancel it, etc.). If you are using array jobs, thenJOBID
will have formatXX_YY
, whereXX
is the array job ID andYY
is the task IDPARTITION
: The queue were a job is being or will be executed (it will be usuallyexpress
orbatch
)NAME
: The name of the job, given by-J
parameter in the script fileUSER
: Owner of that jobST
: Status of the job, the most commons areR
for running andPD
for pending, but there are many more possible status, you can check job state codes hereTIME
: running timeNODES
: number of nodes that are being used or will be used (you can specify it using-N
parameter in your script file)NODELIST(REASON)
: if the job is being executed with no problems, it will show the list of nodes that are being. If the job is not running, it will show a short description of the reason (you can check the complete list of job reasons codes), but the most common are the following ones:PartitionTimeLimit
orPartitionNodeLimit
: you are asking for more time or nodes than the available in the partition (queue). It is likely your job will never run, change the walltime or the number of nodes, respectively.Resources
(orNone
): at this moment there are not enough free resources (nodes) to satisfy your job, so it will wait till the needed resources get availableDependency
: this job depends on other job(s) that has not finished yetPriority
: the system is running jobs with higher priorityAssociationJobLimit
: Global limit of hours might have been already reached
squeue
command has many useful options (useman squeue
to see all of them):[lapalma1]$ squeue -t RUNNING # List only my running jobs [lapalma1]$ squeue -t PENDING # List only my pending jobs [lapalma1]$ squeue -r # When running arrayjobs, list one per line [lapalma1]$ squeue -o ... # Specify the output format [lapalma1]$ squeue -S ... # Specify listing order
You can also use
jobtimes
script that will display info about times of your jobs, like estimation about starting and ending time for pending jobs; or total, used and remaining time of the running jobs.# Show times of your jobs: [lapalma1]$ jobtimes
Show status and info of the job with id
<job_id>
[lapalma1]$ scontrol show job <job_id>
You will find there detailed information about that job. If your job is not being executed, search for text
Reason=
to know why it is still pending. Also take a look toStartTime=
where you can find an estimation about when your job could be executed. Adding-d
or-dd
will show more details when available.[lapalma1]$ sstat -j <job_id>
With this command you can get a large set of information about the status of running jobs and the consumed hardware resources, like: CPU time, Virtual Memory size, I/O operations size, page faults, Resident Set size, etc. With
sstat -e
you get a complete list of parameters that can be displayed, and then you can use-o ...
or--format=...
options to specify which one(s) you want to show (seeman sstat
for more details).[lapalma1]$ sacct -j <job_id> [lapalma1]$ sacct --format=JobID,JobName,NNodes,NCPUs,AllocCPUs,MAXRSS,Elapsed,TotalCPU,State -j <job_id>
This command displays information about the accounting of [running or complete jobs]{.underline} (you can specify a time range). With
sacct -e
you get a complete list of parameters that can be displayed, and then you can use-o ...
or--format=...
options to specify which one(s) you want to show (seeman sacct
for more details). This command is specially useful to monitor the Memory usage, and check whether the job is not trying to use more than the available memory.
Deleting jobs
Remove running or waiting jobs (you need to be the owner of those jobs):
[lapalma1]$ scancel <job_id> # Examples: [lapalma1]$ scancel 1234 # Cancel job 1234 [lapalma1]$ scancel 123[4-6] # Cancel jobs 1234, 1235 and 1236 [lapalma1]$ scancel 1234 1236 # Cancel jobs 1234 and 1236
Modifying jobs (updating/holding/suspending)
If after submitting a job you need to change some of the options you have specified (TimeLimit, Partition, etc.), you can cancel the job, edit the script and re-submit it again, or you can update the job with command
scontrol
(use the same name of options as those displayed when runningscontrol show
, case-insentive. Not all options are adjustable after submission, some of them also depends on the current state of the job):[lapalma1]$ scontrol update JobID=<job_id> <option>=<value> # Examples: [lapalma1]$ scontrol update JobID=1234 TimeLimit=02:00:00 # Update job 1234: set new walltime to 2 hours [lapalma1]$ scontrol update JobID=1234 Partition=express # Update job 1234: set new queue to express
Sometimes you may be interested in holding/suspending some jobs (you need to be the owner of those jobs,
<job_id>
could be one id or a list of them):[lapalma1]$ scontrol hold <job_id> # Hold a pending job [lapalma1]$ scontrol release <job_id> # Release a previously held job [lapalma1]$ scontrol suspend <job_id> # Suspend a running job [lapalma1]$ scontrol resume <job_id> # Resume a previously suspended job [lapalma1]$ scontrol requeue <job_id> # Cancel a running job and queue it again [lapalma1]$ scontrol requeuehold <job_id> # Cancel a running job and hold it
Other useful commands (consumed resources, status of queues, etc.)
Resources
At the end of the output log of your jobs, a report about resource usage is automatically provided. It states how much memory and CPU time have been used compared to what has been allocated for the job. For example:
######################## JOB EFFICIENCY REPORT ########################
# JobID: <jobid>
# Cluster: lapalma3
# User/Group: xxxxxx/xxxxx
# Cores: 128
# Nodes: 8
# CPU Utilized: 1851.82 CPU-hours
# Wall-clock time: 0 days 14:28:53
# CPU Efficiency: 99.90 %
# Memory Utilized: 19.45 GB
# Memory Efficiency: 1.09 % of 1792.00 GB
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# If your job has low CPU Efficiency or you have doubts about setting
# up a job, do not hesitate and contact us:
# res_support@iac.es
#######################################################################
Typical jobs should have a very high CPU efficiency (>90%), and the memory usage can vary depending on the application.
Warning
A low CPU usage (<10%) is likely caused by a misconfiguration of the submission script!
Note
Due to limitations in the way slurm measures the CPU time, when a job does not
finish gracefully (e.g., scancel
or evicted after the end of the allocated time)
the effiency report is not trustworthy.
Thus, a warning message is shown instead.
In addition, you can use commands like sreport
, sacct
and sstat
to see how much
resources have been consumed by your jobs (time, memory, etc.). Some
examples:
See how many hours you have used in a given period (for example, in March 2018):
[lapalma1]$ sreport -t hour cluster UserUtilizationByAccount Start=2018-03-01T00:00:00 End=2018-03-31T23:59:59
See how much CPU time has been consumed by each of your jobs in a given period (for example, in March 2018):
[lapalma1]$ sacct -T -X -D -S 2018-03-01T00:00:00 -E 2018-03-31T23:59:59 -o JobID,JobName,NCPUs,Submit,Start,End,CPUTime -s running
Tip
There are many options to show and format the results, try
man sacct
to get more information orsacct -e
to display the complete list of fields (for instance, last command will show time with formatHH:MM:SS
, but you can useCPUTimeRaw
instead ofCPUTime
to get the time in seconds, it could be useful if you want to perform some operations). Commandsstat
also displays useful information, but it only works when jobs are running.See the utilization and fairshare:
[lapalma1]$ sshare