FAQs about LaPalma
General topics
Preparing your executions
Where should I install programs common to all the members of my project group? And temporary data?
My application needs a special software (library, tool, ...) to run, is it available?
Should I be careful with the Input / Output over the parallel filesystem (Lustre)?
Should I be concerned about the big/little endian problem like when executing on La Palma2?
Running your jobs
Is there any way to make my jobs wait less in queue before running?
My job is finished but I see no output or it is not complete...
My jobs have some special needs (dependencies, should start at a specific time, ...)
How can I know how much CPU time (or other resources) have been consumed by my jobs?
Software
Responses
Q1: How can I get an account on LaPalma?^
A: LaPalma is one of the thirteen supercomputers that belongs to the Spanish Supercomputing Network (RES). To get an account on this machine:
IAC staff: Please, send us an email to res_support@iac.es and we will inform you
non-IAC staff: check conditions and information at the RES webpage
Q2: When connecting I'm required to enter a password that I don't have...^
A: LaPalma uses a key-based authentication mechanism (see more info in connecting to LaPalma), so no password will be required by this machine.
If you get a message like
Enter passphrase for key...
, this is not a password to connect LaPalma, but a passphrase needed to access your private key on your local machine. Then you must specify the same passphrase you used when creating the ssh key pair.If a password is asked when connecting to LaPalma, something is failing with the key. Most probably you are connecting from a different machine that has no key stored on LaPalma. Before you can connect to LaPalma for the first time, we will ask you to send us the public key of your machine and then we will store it, in order you can connect whenever you want from that computer. If later you want to connect from other machines, you can either send us the public keys of those machines or you can directly store them on LaPalma. For that purpose you must connect from an already authorized computer and add the new key(s) to file
~/.ssh/authorized_keys
(do not delete old keys if you still want to connect from those machines).
Q3: Should I add an Acknowledge text?^
A: Yes. Please, add next text to your publications if they are based on the results obtained on LaPalma:
Acknowledgement
The author thankfully acknowledges the technical expertise and assistance provided by the Spanish Supercomputing Network (Red Española de Supercomputación), as well as the computer resources used: the LaPalma Supercomputer, located at the Instituto de Astrofísica de Canarias.
For IAC users, the use of LaPalma should also be acknowledged when filling the "Informe Anual" of the projects. When introducing a refereed publications in the section "Produccion Cientifica", add as a used resource the following: "Supercomputing: LaPalma".
Q4: How can I get some help?^
A: We are continuously updating this website according to the most common issues that our users may have. Please, read the other sections, like Introduction, Connecting, Useful Commands preparations), Useful Commands (executions) and examples of Script files.
If your question is not explained in this website, do not hesitate to send us an email. Also contact us with any issue, doubt, suggestion, etc. you may have: res_support@iac.es
Q5: How I can see the how much free disk space I have?^
A: All members of your group will share the same quota, for both
total disk space and maximum number of files. To check it, please,
execute next command (your_group
should be your username without the
three last digits):
[lapalma1]$ lfs quota -hg your_group /storage
Q6: Where should I install programs common to all the members of my project group? And temporary data?^
A: You should install programs accessible to all your group members
in the directory /storage/projects/your_group
.
For temporary data, you can use
/storage/scratch/your_group/your_username/
directory. When your jobs
are running, they can use the local hard disk of each node to store
temporary data (access there will be faster, but that space is only
available to your jobs when they are running).
Q7: Which compilers are available on LaPalma?^
A: On LaPalma you can use GNU, and Intel compilers for sequential/OpenMP codes, and OpenMPI compilers for parallel MPI codes. See compiling section for more details and optimization options for each compiler.
Tip
Please, contact us if you are using Intel Compilers and you have any issue with the license. There is no license for Intel Parallel Studio, however, the Intel MPI Libraries and MKL are available, so most applications compiled with Intel compilers on other compatible systems should run with no problems on LaPalma.
Q8: What version of MPI is currently available at LaPalma?^
A: At least OpenMPI 3.0.1 is installed on LaPalma and it has full MPI-3.1 standards conformance, so you should be able to compile your MPI-1, MPI-2 and MPI-3 applications with no problems on LaPalma.
Q9: My application needs a special software (library, tool, ...) to run, is it available?^
A: To see updated list of installed compilers, programs, libraries, tools, etc. and their versions, please, use next command:
[lapalma1]$ module avail
If the required software is in that list, you only need to load it using
module load <module_name>
command (check also the useful commands).
If the software you need is not in that list (or you need another
version), and you cannot install it locally, please, contact us.
Q10: I got some warnings/errors when loading software modules^
A: Some modules have prerequisites and/or conflicts. For instance,
to load OpenMPI/gnu
you previously need to load gnu
. If a module of
OpenMPI
is loaded, you cannot load any other version of OpenMPI
until you unload the current module or switch it. You will receive
warnings and hints when loading is not possible due to prerequisites
or conflicts. Examples:
WARNING: openmpi/gnu/3.0.1 cannot be loaded due to missing prereq.
HINT: at least one of the following modules must be loaded first: gnu
WARNING: openmpi/intel/3.0.1 cannot be loaded due to a conflict.
HINT: Might try "module unload openmpi" first.
Although it is very uncommon, it could happen that after some failing attempts of loading or unloading modules, you receive a warning about a corrupt environment. At this point, it is much safer if you close your current shell and begin a new one, in order to work with a clean session.
Q11: Should I be careful with the Input / Output over the parallel filesystem (Lustre)?^
A: Parallel Filesystem can be a bottleneck when different processes of one job are writing to Lustre along the execution. In this kind of jobs, one possible way to improve the job performance is to copy the needed data for each job to the local scratch at the beginning and copy back to lustre at the end, (with this scheme, most of I/O will be performed locally). This scheme is also recommended for massive sets of sequential jobs.
Also bear in mind that Lustre is an open-source, parallel file system
that offers a much better performance than NFS when working with large
files and parallel access, since your files could be split into smaller
pieces and stored in different Object Storage Targets (OSTs) to improve
the performance in parallel access. You can check and set the stripping
options of your files and/or directories inside Lustre with lfs
command (by default no stripping is done, so you may want to set your
own options). For instance, some basic examples (use man lfs
or check
documentation for further details and more options):
[lapalma1]$ lfs osts /storage # List all available OSTs
[lapalma1]$ lfs getstripe <file or dir> # Get stripping options of a file or dir
[lapalma1]$ lfs setstripe -c -1 <file or dir> # Strip file or dir across all OSTs (if you specify a directory,
# stripping will be applied to new files, but not the existing ones)
# -c specify the stripe count (how many OSTs will be used,
# -1 means all of them). You can also change stripe size (-S)
# or the stripe offset (-o), but it is not recommended
There are some advice and tips you can follow to achieve a better performance, check the official documentation (Tutorials Manual, Wiki, etc.). Also some other institutions have valuable info about using Lustre file systems, like the Lustre Basics and Best Practices available at NAS (NASA Advanced Supercomputing).
Q12: Should I be concerned about the big/little endian problem like when executing on La Palma2?^
A: No, you should not be worried about the
endianness when working with LaPalma3
. This machine is formed by
Intel processors (little-endian architecture), so it is almost sure
that they have the same endiannes that the processors of your laptop
or desktop PC (usually Intel or AMD). Only if you are transferring
binary files from other computers with big-endian architectures (like
PowerPC processors) you may have some problems related to endianness,
contact us to solve that.
Q13: How can I know the number of idle nodes?^
A: It could be useful to know the number of idle nodes before submitting your scripts, in order to reduce the waiting time in the queue. There are a couple of scripts that show information about how many nodes (and cores) are currently idle:
# Show number of idle nodes:
[lapalma1]$ idlenodes
# Show number of idle cores (basically 16 times the number of nodes):
[lapalma1]$ idlecores
Note
Information shown by these commands could have a delay of some seconds in relation to the real current status of the queue
Q14: Can I execute my programs interactively?^
A: When you open a ssh session on LaPalma, you are connected to the login node that is used by all users. In order to keep it on a proper load level so everyone can work on it, it is forbidden to run any heavy/parallel process on the login node. Login node should only be used to prepare the executions (compile your code, decompress the data, etc., as far as those tasks take less than 10 minutes), while all the executions and long tasks must be carried out on the computing nodes through the queue system. Therefore you will need to create a job to run your applications, and it will be executed according to your priority when all the resources you need are available. See the executing your applications section on the Useful commands page and also the examples of script files to know how to submit your jobs.
If for any reason you need to work interactively on the login node for a long while (for instance, to work on a visualization), then you must submit a job to the interactive queue, so you will be granted 1 hour of working time on 1 single cpu. Use next command for this purpose, and remember to exit the interactive session once you are done:
[lapalma1]$ salloc -p interactive
Example:
# 1) Sumbit a job to the interactive queue and wait it is allocated
[lapalma1]$ salloc -p interactive
salloc: Pending job allocation 1234
salloc: job 1234 queued and waiting for resources
salloc: job 1234 has been allocated resources
salloc: Granted job allocation 1234
salloc: Waiting for resource configuration
salloc: Nodes login1 are ready for job
# 2) Now you can work up to 1 hour on login1 (only for sequential tasks!)
[login1]$ ...
# 3) Once you are done, exit the job
[login1]$ exit
exit
salloc: Relinquishing job allocation 1234
# 4) You are again using normal mode, executions longer than 10min are not allowed
[lapalma1]$
Q15: I get an error when submitting my jobs...^
A: When submitting your jobs you have to specify some mandatory parameters (see these examples), if any of these parameters is missing you will receive an error when submitting your job.
If you receive a message like this...
sbatch: error: This does not look like a batch script. The first
sbatch: error: line must start with #! followed by the path to an interpreter.
sbatch: error: For instance: #!/bin/sh
... check that your script begins with #!/bin/sh
and there are no
trailing whitespaces before the #
symbol (remove them if any)
If you receive a message like this:
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
... you may be asking for resources that exceed the present limits. Remember there is a maximum number of cores you can use, a run-time limit, etc.
Q16: How can I see the status of my jobs in queue?^
A: Next command will provide you information about your jobs in queues:
[lapalma1]$ squeue
To obtain detailed information about a specific job:
[lapalma1]$ scontrol show job <job_ID>
If you want to get more info about your job times (estimation of starting and ending times, used and remaining time, etc):
[lapalma1]$ jobtimes
Q17: I have submitted some jobs, but they never run...^
A: If you submitted some jobs and they have been waiting in the queue for a long while, try the next steps:
Check that there are no problems with that job: Use command
squeue
to see your queue status and check that waiting jobs showPD
in theSTATE
column. If some of them showF
, there are some problems that you need to fix (you can check job state codes here).Run command
squeue
and check columnNODELIST(REASON)
, there you will find information about why your job is not running... If you want further details, run command "scontrol show job <job id>
" and search for string "JobState=PENDING Reason=XXX Dependency=...
". Reason field should tell you why your job is still waiting. You can check the complete list of job reasons codes, but the most common are the following ones:PartitionTimeLimit
orPartitionNodeLimit
: you are asking for more time or nodes than the available in the partition (queue). It is likely your job will never run, change the walltime or the number of nodes, respectively. For instance, you are trying to run a 80-hour job when the maximum is 72-hour, so you need to change either the walltime or queue and set a valid one (remember you can edit your job after submission, it is not needed to cancel and re-submit it).Resources
(orNone
): at this moment there are not enough free resources (nodes) to satisfy your job, so it will wait till the needed resources get availableDependency
: this job depends on other job(s) that has/have not finished yet.Priority
: there are jobs with higher priority than yours.AssociationJobLimit
: if your group has a limitation over the executions, that limit has been reached.
Q18: I get an error when running my jobs...^
A: If your job is submitted, but there are errors when it is
executed, you should find some information in the error file (the one
specified with parameter -e
). Please, check that file to find out
where the problems seem to be located, most times they are related to:
your submit files: check if all the SLURM options are correct. If you receive errors about executable command not found (like
mpirun: command not found
orerror: execve(): vasp: No such file or directory
), missing dynamic libraries (*.so
), etc., make sure you are loading the proper modules (we recommend you clean the environment usingmodule purge
and then load only the needed modules. Also check you are specifying the right paths and/or commands to execute your program and locate the input/output files.your code: there are some bugs, invalid operations, conditions that has not been considered, incorrect paths, etc.
the problem you are solving: size is too big and you are asking for bigger amount of memory / disk than the one it is available, etc.
the system: there is a one-time problem with the file system or the network, etc.
If you see errors like Fatal error in MPI_Init
or your MPI programs
are not running in parallel, maybe you are not executing using the right
commands. Depending on how the MPI program was compiled, it should work
with one of the next command:
mpirun program
or
srun program
or
srun --mpi=pmi2 program
Q19: How do I know the position of my jobs in queue?^
A: You can use the command, that shows information about estimated
time for the specified job to be executed (check value of StartTime
field).
[lapalma1]$ scontrol show job <job_ID>
Also you can try next script that will give you information about times related to your jobs:
[lapalma1]$ jobtimes
Q20: Is there any way to make my jobs wait less in queue before running?^
A: You must tune the directive #SBATCH -t wall_clock_limit
to the
expected job duration. This value will be used by to decide when to
schedule your job, so, shorter values are likely to reduce waiting time;
However, notice that when a job exceeds its wall_clock_limit will be
cancelled, so, it is recommended to work with an small security margin.
Q21: I want to stop a running job, how can I do that?^
A: If you want to stop a job that it is already running, or want to remove the job from the queue when it is waiting, simply delete that job with next command:
[lapalma1]$ scancel <job id>
Check also other useful commands when executing your jobs.
Q22: My job is finished but I see no output or it is not complete...^
A: Output that is normally printed on screen (stadout
and
stderr
) will be saved to files by the Batch-queuing System once your
jobs are done. Names and locations of those output files have to be
specified in the script file. If you do not see these files or they are
truncate, please, check next steps:
Use
squeue
to make sure that the job has already finished (it should not appear in the list). Once the job is finished, the system will begin to copy output files from the nodes where your jobs were executed on. That process could take a while, so it could happen that you may need to wait some seconds or a few minutes till all your files are copied, that will depend on the number and size of your output files.If your job is finished, but you do not see output files, check the parameters in the script to see where the files should have been created. Make sure that you have used the
-o
parameter for standard output and-e
for errors. You can use absolute or relative paths (relative to the working directory, the one you specified using parameter-D
). Check that all paths exist and they are the expected ones, maybe there is an error in the paths and your files were created in a different location.Check that your jobs create files with different names, to avoid that one job can overwrite the output files of other job(s). The easiest way to avoid this is adding
%j
somewhere in your filenames when using-o
and-e
parameters, so your files will include the number of the job that is different in every submission. If you are running jobs array, you will want to add also other values like%a
and%A
).Check the run-time limit (set with
-t
parameter). If your application exceeds that limit, then your program will be terminated by the system and your output will be probably truncate.Check that your disk quota is not exceeded.
Check if your application crashed (internal error, not enough memory, etc.). For instance, using commands like
sstat
orsacct
you can get information like the maximum memory used, check that value is not close to the available memory).Contact us if none of these steps solved your problem.
Q23: My jobs have some special needs (dependencies, should start at a specific time, ...)
Dependencies
If some of your jobs depend on other one(s), you can specify the
dependencies in the script file that you submit to the queue. For
example, suppose you submit two jobs with IDs 12301
and 12302
, and
you want to submit a third job that should only begin if the two
previous jobs were successfully executed. Then you need to add next line
in the script file of the third job:
#SBATCH -d afterok:12301:12302
or
#SBATCH --dependency=afterok:12301:12302
You can also specify the dependencies in command line when submitting the job, then it will be easier since you do not need to change the script file to specify the IDs:
[lapalma1]$ sbatch -d afterok:12301:12302 script.sub
Other events can be used when specifying dependencies, like jobs that
will begin only if other jobs fail (afternotok
), or if other jobs
finish successfully or not (afterany
), or after other jobs begin
(after
), etc. You can also specify a list of jobs for the
dependencies, specifying if all of them have to satisfy the dependencies
or just any (see more
options). Also bear in mind that dependencies
can be modified after submission with scontrol
command.
Dependencies can be also used for very long jobs that exceed the time
limit of all available queues. If the application you are running is
able to generate checkpoints and resume the execution from those
checkpoints, then several jobs can be submitted forcing that each job
has to wait before the previous ones are terminated (use -d singleton
for this, job will only begin if all previous jobs with same job name
and user have finished). You also need to prepare your program to
generate checkpoints before the walltime of the queue is reached and the
job is killed, and make them available so the next starting job can use
the last checkpoint to resume the execution.
Deferral time
If your jobs need to begin at a certain time (maybe after the data is
generated and automatically copied to LaPalma3), then you can use
--begin
option and specify the time that the job should start (if
there are enough resources). Time could be absolute (--begin=21:30
,
--begin=2016-10-02T17:23:30
) or relative (--begin=now+2hour
or
--begin=now+7200
to begin in two hours after submission).
Q24: Can I automatically run on LaPalma many instances of my parallel or serial) program with different input data?
A: If you have any parallel (or serial*) program that has to be run
over a large set of different data, it is possible to automatize the
executions (like using
GREASY on LaPalma2
). This is possible using the jobs
array feature available in SLURM (the
batch-queued system), you can check some examples in the script file section.
Please, contact us to study your problem and help you with these executions.
Important
If you are using jobs array to execute sequential programs, make sure you are doing things properly and all cores of each node are being used. If your submit script is not correct, it could be relatively easy to end up executing just a sequential program on each node, so 15 cores will be wasted per node. If you run this on many nodes, you could very quickly consume your assigned hours, wasting 94% of the resources that have been granted to you. So, please, test and double check your submit script before submitting it to the queue, and contact us if you have any doubt (you can use this the sequential jobs array example listed in script file section as template).
Q25: How can I know how much CPU time (or other resources) have been consumed by my jobs?^
A: You can use commands like sreport
, sacct
and/or sstat
to
know the CPU time that your jobs have consumed in a given time, both
total amount or detailed by each job. Please, visit the useful commands section
where there are several examples about using these commands and their
options.
Q26: How do I access Jupyter?^
A: You can use the standalone version of jupyter. Log in to lapalma and get the numerical ID of your user:
[lapalma1]$ id -u
<numerical ID>
Load the python module and prerequisites:
[lapalma1]$ module load gnu/7.2.0 python/3.7.4 uv/1.33.1/gnu/7.2.0 icu/64.2/gnu/7.2.0
Request an interactive session of 1 core and the amount of time we are going to use Jupyter Notebook:
[lapalma1]$ salloc -n 1 --time <hh:mm:ss>
salloc: Pending job allocation <job ID>
salloc: job <job ID> queued and waiting for resources
salloc: job <job ID> has been allocated resources
salloc: Granted job allocation <job ID>
salloc: Waiting for resource configuration
salloc: Nodes <compute node> are ready for job
Get the hostname of the compute node and write it down:
[<user>@<compute node>]$ hostname
<compute node name>
Run Jupyter and copy the URL generated:
[<user>@<compute node>]$ XDG_RUNTIME_DIR="" jupyter-lab --no-browser --ip=127.0.0.1 --port=$(id -u)
http://127.0.0.1:<numerical_ID>?token=<token_generated>
Launch a new session to LaPalma using the command:
[your machine]$ ssh -4 -t <user>@lapalma1.iac.es -L <numerical ID>:localhost:<numerical ID> ssh -4 <compute node name> -L <numerical ID>:localhost:<numerical ID>
From your machine access to URL previously generated.