LaPalma3 (3b): Useful commands
Please note that all the SIEpedia's articles address specific issues or questions raised by IAC users, so they do not attempt to be rigorous or exhaustive, and may or may not be useful or applicable in different or more general contexts.
Introduction | Connecting | Useful Commands (preparations) | Useful Commands (executions) | Script files | FAQs |
IMPORTANT: This documentation is deprecated. It will not be further updated. The new documentation for LaPalma can be found here for external users or here if you are connected to IAC's internal network.
Executing your applications
Like other supercomputers, on LaPalma3 there is a Batch-queuing System (SLURM v21.08) that manages user's executions. Therefore, to run your application you have to send it to the queue specifying some parameters and the system will execute it when possible. Here there are the most useful commands to manage your jobs, but we recommend you check the SLURM Quick Start User Guide.
We have gathered a list of useful commands when working with LaPalma3:
- Submitting jobs
- Checking the status of jobs
- Deleting jobs
- Modifying jobs (updating/holding/suspending)
- Other useful commands (consumed resources, status of queues, etc.)
Useful commands that you may need before executing your application (connecting to LaPalma3, transferring files, compiling, etc.) are listed at Useful Commands (preparations).
Submitting jobs
Submission is performed using sbatch
command specifying some information like number of processors, parallel environment, location of the executable file, etc. Although it is possible to specify this information in the command line, the usual (and recommended) way is to write all parameters in a script file that will be specify when submitting. To know how to prepare your script files, please, check these examples. Once your script file is ready, try next command:
[lapalma1]$ sbatch
<script_file>
When you submit the script file, your parameters will be checked and you will be informed if there are errors. If the job is accepted, you will receive some info, like the job id that you may need later to get further details about that job or even cancel it.
[lapalma1]$sbatch
mytest.sub Submitted batch job1234
# Your job id is1234
If you just want to check if your submit script has no errors, or know an estimation about when it will be probably executed, you can use --test-only
flag. That will simulate the submission, but it will not be performed:
[lapalma1]$ sbatch --test-only
mytest.sub
IMPORTANT: All parallel programs must be executed by the queue system. Do NOT attempt to run your parallel applications interactively on the login nodes. (see this FAQ)
There are a couple of scripts that show information about how many nodes (and cores) are idle at that moment. You might want to use that information when asking for resources:
# Show number of idle nodes: [lapalma1]$idlenodes
# Show number of idle cores (basically 16 times the number of nodes): [lapalma1]$idlecores
Note: Information shown by these commands could have a delay of some seconds in relation to the real current status of the queue
Checking the status of jobs
You can check the status of jobs using next commands:
- Check status of jobs (you will see ONLY your own jobs)
[lapalma1]$ squeue
Using this command you will get useful information:
JOBID
: The ID of each job, this value will be required if you want to do some operations with one of those jobs (ask for further details, cancel it, etc.). If you are using array jobs, thenJOBID
will have formatXX_YY
, whereXX
is the array job ID andYY
is the task IDPARTITION
: The queue were a job is being or will be executed (it will be usuallyexpress
orbatch
)NAME
: The name of the job, given by-J
parameter in the script fileUSER
: Owner of that jobST
: Status of the job, the most commons areR
for running andPD
for pending, but there are many more possible status, you can check job state codes hereTIME
: running timeNODES
: number of nodes that are being used or will be used (you can specify it using-N
parameter in your script file)NODELIST(REASON)
: if the job is being executed with no problems, it will show the list of nodes that are being. If the job is not running, it will show a short description of the reason (you can check the complete list of job reasons codes), but the most common are the following ones:PartitionTimeLimit
orPartitionNodeLimit
: you are asking for more time or nodes than the available in the partition (queue). It is likely your job will never run, change the walltime or the number of nodes, respectively.Resources
(orNone
): at this moment there are not enough free resources (nodes) to satisfy your job, so it will wait till the needed resources get availableDependency
: this job depends on other job(s) that has not finished yetPriority
: the system is running jobs with higher priorityAssociationJobLimit
: Global limit of hours might have been already reached
squeue
command has many useful options (use man squeue
to see all of them):
[lapalma1]$squeue -t RUNNING
# List only my running jobs [lapalma1]$squeue -t PENDING
# List only my pending jobs [lapalma1]$squeue -r
# When running arrayjobs, list one per line [lapalma1]$squeue -o ...
# Specify the output format [lapalma1]$squeue -S ...
# Specify listing order
You can also use jobtimes
script that will display info about times of your jobs, like estimation about starting and ending time for pending jobs; or total, used and remaining time of the running jobs.
# Show times of your jobs:
[lapalma1]$ jobtimes
- Show status and info of the job with id
<job_id>
[lapalma1]$ scontrol show job
<job_id>
You will find there detailed information about that job. If your job is not being executed, search for text Reason=
to know why it is still pending. Also take a look to StartTime=
where you can find an estimation about when your job could be executed. Adding -d
or -dd
will show more details when available.
[lapalma1]$ sstat -j
<job_id>
With this command you can get a large set of information about the status of running jobs and the consumed hardware resources, like: CPU time, Virtual Memory size, I/O operations size, page faults, Resident Set size, etc. With sstat -e
you get a complete list of parameters that can be displayed, and then you can use -o ...
or --format=...
options to specify which one(s) you want to show (see man sstat
for more details).
[lapalma1]$sacct -j
<job_id> [lapalma1]$sacct --format=JobID,JobName,NNodes,NCPUs,AllocCPUs,MAXRSS,Elapsed,TotalCPU,State -j
<job_id>
This command displays information about the accounting of running or complete jobs (you can specify a time range). With sacct -e
you get a complete list of parameters that can be displayed, and then you can use -o ...
or --format=...
options to specify which one(s) you want to show (see man sacct
for more details). This command is specially useful to monitor the Memory usage, and check whether the job is not trying to use more than the available memory.
Deleting jobs
- Remove running or waiting jobs (you need to be the owner of those jobs):
[lapalma1]$scancel
<job_id> # Examples: [lapalma1]$scancel
1234 # Cancel job 1234 [lapalma1]$scancel
123[4-6] # Cancel jobs 1234, 1235 and 1236 [lapalma1]$scancel
1234 1236 # Cancel jobs 1234 and 1236
Modifying jobs (updating/holding/suspending)
- If after submitting a job you need to change some of the options you have specified (TimeLimit, Partition, etc.), you can cancel the job, edit the script and re-submit it again, or you can update the job with command
scontrol
(use the same name of options as those displayed when runningscontrol show
, case-insentive. Not all options are adjustable after submission, some of them also depends on the current state of the job):
[lapalma1]$scontrol update JobID=
<job_id><option>=
<value> # Examples: [lapalma1]$scontrol update JobID=
1234TimeLimit=
02:00:00 # Update job 1234: set new walltime to 2 hours [lapalma1]$scontrol update JobID=
1234Partition=express
# Update job 1234: set new queue toexpress
- Sometimes you may be interested in holding/suspending some jobs (you need to be the owner of those jobs,
<job_id>
could be one id or a list of them):
[lapalma1]$scontrol hold
<job_id> # Hold a pending job [lapalma1]$scontrol release
<job_id> # Release a previously held job [lapalma1]$scontrol suspend
<job_id> # Suspend a running job [lapalma1]$scontrol resume
<job_id> # Resume a previously suspended job [lapalma1]$scontrol requeue
<job_id> # Cancel a running job and queue it again [lapalma1]$scontrol requeuehold
<job_id> # Cancel a running job and hold it
Other useful commands (consumed resources, status of queues, etc.)
Resources
At the end of the output log of your jobs, a report about resource usage is automatically provided. It states how much memory and CPU time have been used compared to what has been allocated for the job. For example:
######################## JOB EFFICIENCY REPORT ######################## # JobID: <jobid> # Cluster: lapalma3 # User/Group: xxxxxx/xxxxx # Cores: 128 # Nodes: 8 # CPU Utilized: 1851.82 CPU-hours # Wall-clock time: 0 days 14:28:53 # CPU Efficiency: 99.90 % # Memory Utilized: 19.45 GB # Memory Efficiency: 1.09 % of 1792.00 GB #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # If your job has low CPU Efficiency or you have doubts about setting # up a job, do not hesitate and contact us: # res_support@iac.es #######################################################################
Typical jobs should have a very high CPU efficiency (>90%), and the memory usage can vary depending on the application.
NOTE: A low CPU usage (<10%) is likely caused by a misconfiguration of the submission script!
In addition, you can use commands like sreport
, sacct
and sstat
to see how much resources have been consumed by your jobs (time, memory, etc.). Some examples:
- See how many hours you have used in a given period (for example, in March 2018):
[lapalma1]$sreport -t hour cluster UserUtilizationByAccount Start=
2018-03-01T00:00:00End=
2018-03-31T23:59:59
- See how much CPU time has been consumed by each of your jobs in a given period (for example, in March 2018):
[lapalma1]$sacct -T -X -D -S
2018-03-01T00:00:00-E
2018-03-31T23:59:59-o JobID,JobName,NCPUs,Submit,Start,End,CPUTime -s running
There are many options to show and format the results, try man sacct
to get more information or sacct -e
to display the complete list of fields (for instance, last command will show time with format HH:MM:SS
, but you can use CPUTimeRaw
instead of CPUTime
to get the time in seconds, it could be useful if you want to perform some operations). Command sstat
also displays useful information, but it only works when jobs are running.
- See the utilization and fairshare:
[lapalma1]$ sshare