Useful Commands (executions)
Shutdown notice
LaPalma was shutdown on September 1st to upgrade its hardware. We expect the system to be back online in December.
Like other supercomputers, on LaPalma3 there is a Batch-queuing System (SLURM v21.08) that manages user's executions. Therefore, to run your application you have to send it to the queue specifying some parameters and the system will execute it when possible. Here there are the most useful commands to manage your jobs, but we recommend you check the SLURM Quick Start User Guide.
We have gathered a list of useful commands when working with LaPalma3:
Useful commands that you may need before executing your application (connecting to LaPalma3, transferring files, compiling, etc.) are listed at Useful Commands (preparations).
Submitting jobs
Submission is performed using sbatch command specifying some
information like number of processors, parallel environment, location of
the executable file, etc. Although it is possible to specify this
information in the command line, the usual (and recommended) way is to
write all parameters in a script file that will be specify when
submitting. To know how to prepare your script files, please, check
these examples.
Once your script file is ready, try next command:
[lapalma1]$ sbatch <script_file>
When you submit the script file, your parameters will be checked and you will be informed if there are errors. If the job is accepted, you will receive some info, like the job id that you may need later to get further details about that job or even cancel it.
[lapalma1]$ sbatch mytest.sub
Submitted batch job 1234 # Your job id is 1234
If you just want to check if your submit script has no errors, or know
an estimation about when it will be probably executed, you can use
--test-only flag. That will simulate the submission, but it will not
be performed:
[lapalma1]$ sbatch --test-only mytest.sub
Important
All parallel programs must be executed by the queue system. Do NOT attempt to run your parallel applications interactively on the login nodes. (see this FAQ)
There are a couple of scripts that show information about how many nodes (and cores) are idle at that moment. You might want to use that information when asking for resources:
# Show number of idle nodes:
[lapalma1]$ idlenodes
# Show number of idle cores (basically 16 times the number of nodes):
[lapalma1]$ idlecores
Note
Information shown by these commands could have a delay of some seconds in relation to the real current status of the queue
Checking the status of jobs
You can check the status of jobs using next commands:
Check status of jobs (you will see ONLY your own jobs)
[lapalma1]$ squeue
Using this command you will get useful information:
JOBID: The ID of each job, this value will be required if you want to do some operations with one of those jobs (ask for further details, cancel it, etc.). If you are using array jobs, thenJOBIDwill have formatXX_YY, whereXXis the array job ID andYYis the task IDPARTITION: The queue were a job is being or will be executed (it will be usuallyexpressorbatch)NAME: The name of the job, given by-Jparameter in the script fileUSER: Owner of that jobST: Status of the job, the most commons areRfor running andPDfor pending, but there are many more possible status, you can check job state codes hereTIME: running timeNODES: number of nodes that are being used or will be used (you can specify it using-Nparameter in your script file)NODELIST(REASON): if the job is being executed with no problems, it will show the list of nodes that are being. If the job is not running, it will show a short description of the reason (you can check the complete list of job reasons codes), but the most common are the following ones:PartitionTimeLimitorPartitionNodeLimit: you are asking for more time or nodes than the available in the partition (queue). It is likely your job will never run, change the walltime or the number of nodes, respectively.Resources(orNone): at this moment there are not enough free resources (nodes) to satisfy your job, so it will wait till the needed resources get availableDependency: this job depends on other job(s) that has not finished yetPriority: the system is running jobs with higher priorityAssociationJobLimit: Global limit of hours might have been already reached
squeuecommand has many useful options (useman squeueto see all of them):[lapalma1]$ squeue -t RUNNING # List only my running jobs [lapalma1]$ squeue -t PENDING # List only my pending jobs [lapalma1]$ squeue -r # When running arrayjobs, list one per line [lapalma1]$ squeue -o ... # Specify the output format [lapalma1]$ squeue -S ... # Specify listing order
You can also use
jobtimesscript that will display info about times of your jobs, like estimation about starting and ending time for pending jobs; or total, used and remaining time of the running jobs.# Show times of your jobs: [lapalma1]$ jobtimes
Show status and info of the job with id
<job_id>[lapalma1]$ scontrol show job <job_id>
You will find there detailed information about that job. If your job is not being executed, search for text
Reason=to know why it is still pending. Also take a look toStartTime=where you can find an estimation about when your job could be executed. Adding-dor-ddwill show more details when available.[lapalma1]$ sstat -j <job_id>
With this command you can get a large set of information about the status of running jobs and the consumed hardware resources, like: CPU time, Virtual Memory size, I/O operations size, page faults, Resident Set size, etc. With
sstat -eyou get a complete list of parameters that can be displayed, and then you can use-o ...or--format=...options to specify which one(s) you want to show (seeman sstatfor more details).[lapalma1]$ sacct -j <job_id> [lapalma1]$ sacct --format=JobID,JobName,NNodes,NCPUs,AllocCPUs,MAXRSS,Elapsed,TotalCPU,State -j <job_id>
This command displays information about the accounting of [running or complete jobs]{.underline} (you can specify a time range). With
sacct -eyou get a complete list of parameters that can be displayed, and then you can use-o ...or--format=...options to specify which one(s) you want to show (seeman sacctfor more details). This command is specially useful to monitor the Memory usage, and check whether the job is not trying to use more than the available memory.
Deleting jobs
Remove running or waiting jobs (you need to be the owner of those jobs):
[lapalma1]$ scancel <job_id> # Examples: [lapalma1]$ scancel 1234 # Cancel job 1234 [lapalma1]$ scancel 123[4-6] # Cancel jobs 1234, 1235 and 1236 [lapalma1]$ scancel 1234 1236 # Cancel jobs 1234 and 1236
Modifying jobs (updating/holding/suspending)
If after submitting a job you need to change some of the options you have specified (TimeLimit, Partition, etc.), you can cancel the job, edit the script and re-submit it again, or you can update the job with command
scontrol(use the same name of options as those displayed when runningscontrol show, case-insentive. Not all options are adjustable after submission, some of them also depends on the current state of the job):[lapalma1]$ scontrol update JobID=<job_id> <option>=<value> # Examples: [lapalma1]$ scontrol update JobID=1234 TimeLimit=02:00:00 # Update job 1234: set new walltime to 2 hours [lapalma1]$ scontrol update JobID=1234 Partition=express # Update job 1234: set new queue to express
Sometimes you may be interested in holding/suspending some jobs (you need to be the owner of those jobs,
<job_id>could be one id or a list of them):[lapalma1]$ scontrol hold <job_id> # Hold a pending job [lapalma1]$ scontrol release <job_id> # Release a previously held job [lapalma1]$ scontrol suspend <job_id> # Suspend a running job [lapalma1]$ scontrol resume <job_id> # Resume a previously suspended job [lapalma1]$ scontrol requeue <job_id> # Cancel a running job and queue it again [lapalma1]$ scontrol requeuehold <job_id> # Cancel a running job and hold it
Other useful commands (consumed resources, status of queues, etc.)
Resources
At the end of the output log of your jobs, a report about resource usage is automatically provided. It states how much memory and CPU time have been used compared to what has been allocated for the job. For example:
######################## JOB EFFICIENCY REPORT ########################
# JobID: <jobid>
# Cluster: lapalma3
# User/Group: xxxxxx/xxxxx
# Cores: 128
# Nodes: 8
# CPU Utilized: 1851.82 CPU-hours
# Wall-clock time: 0 days 14:28:53
# CPU Efficiency: 99.90 %
# Memory Utilized: 19.45 GB
# Memory Efficiency: 1.09 % of 1792.00 GB
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# If your job has low CPU Efficiency or you have doubts about setting
# up a job, do not hesitate and contact us:
# res_support@iac.es
#######################################################################
Typical jobs should have a very high CPU efficiency (>90%), and the memory usage can vary depending on the application.
Warning
A low CPU usage (<10%) is likely caused by a misconfiguration of the submission script!
Note
Due to limitations in the way slurm measures the CPU time, when a job does not
finish gracefully (e.g., scancel or evicted after the end of the allocated time)
the effiency report is not trustworthy.
Thus, a warning message is shown instead.
In addition, you can use commands like sreport, sacct and sstat to see how much
resources have been consumed by your jobs (time, memory, etc.). Some
examples:
See how many hours you have used in a given period (for example, in March 2018):
[lapalma1]$ sreport -t hour cluster UserUtilizationByAccount Start=2018-03-01T00:00:00 End=2018-03-31T23:59:59
See how much CPU time has been consumed by each of your jobs in a given period (for example, in March 2018):
[lapalma1]$ sacct -T -X -D -S 2018-03-01T00:00:00 -E 2018-03-31T23:59:59 -o JobID,JobName,NCPUs,Submit,Start,End,CPUTime -s running
Tip
There are many options to show and format the results, try
man sacctto get more information orsacct -eto display the complete list of fields (for instance, last command will show time with formatHH:MM:SS, but you can useCPUTimeRawinstead ofCPUTimeto get the time in seconds, it could be useful if you want to perform some operations). Commandsstatalso displays useful information, but it only works when jobs are running.See the utilization and fairshare:
[lapalma1]$ sshare