TeideHPC

Introduction

TeideHPC (Teide High Performance Computing) is a supercomputer (like our machine LaPalma) located in the Instituto Tecnológico de Energías Renovables S.A. (ITER). It was the second most powerful supercomputer in Spain and appeared in the 169th position (June 2014) within the Top500 list of the most powerful computers in the world.

It is composed of 1,100 Fujitsu computer servers, with a total of 17,800 computing cores (featuring the latest in Intel Sandy Bridge processors, allowing to obtain not only the best performance but also great energy efficiency), 36 TB of memory, a high-performance network and a parallel system of NetApp storage. According to Top500 it has a theoretical peak performance of 340.8 TFLOPs with a maximal LINPACK performance achieved of 274.0 TFLOPs.

TeideHPC has these types of computing platforms.

Type

Quantity

Processors

Cores

Memory

GPUs

CPU

> 500

2 X Intel Xeon E5-2670

16

32-64GB

-

CPU

72

2 X Intel Xeon E5-2670v2

20

32-64GB

-

CPU

72

2 X Intel Xeon E5-4620

32

128-256GB

-

GPU

16

2 X Intel Xeon Gold 6338 32C

64

256GB

4 Nvidia A100

GPU

1

2 X Intel Xeon Gold 6338 32C

64

256GB

8 Nvidia A100

GPU

4

2 X Intel Xeon Gold 6338 32C

64

256GB

8 Nvidia T4

For storage, it was the following services:

  • NetApp storage with a net capacity of 2.6 Peta Bytes, configured in a cluster format with all redundant elements to face possible hardware failures, with spare disks according to best practices, these being global.

  • Lustre parallel storage will also be available for applications requiring a high number of I/O operations.

Tip

Visit the TeideHPC web and documentation for further details.

Using TeideHPC as IAC researchers

To get an account in Teide, you firstly need to fill an application form with a few general info about your project and executions. To get your account data, please, copy and fill the next form and send it to us via email (res_support@iac.es):

===========================================
=         TeideHPC Application Form       =
===========================================

1. Title of project to be carried out at the TeideHPC computer.

2. Brief description of the research project (recommended maximum: 300 words).

3. Outline the computational algorithms and codes to be used at the
TeideHPC computer. Explain the parallelization capabilities of the code(s)
and any scaling test performed on them.

4. Do you require any specific software and numerical libraries to be
installed in TeideHPC?

5. List the names and email of the members of your research team that
should have access to the TeideHPC computer. If they are external to
the IAC, indicate their affiliation.

6. Resources requested:
    a) estimated min. and max. number of processors needed for each job
    b) average job duration in hours
    c) estimated number of jobs to submit
    d) estimated total time needed (in CPU hours) (*)
    e) total "data" space (in Gigabytes)

How can I connect to TeideHPC?

After asking for your account and receiving the connection data, just follow these instructions to connect to their VPN and then these to login into the machines.

Which resources can I use?

There are some limitations given by the hardware and also the resources that have been assigned to the project we have to use to execute our applications (iac):

  • Time limit (walltime): You can use express queue for executions up to 3 hours, batch queue for executions up to 24 hours, or long queue for executions up to 72 hours. There are also some more queues. Please, contact us if you need more time.

  • Max. number of cores: Since Oct 2015, there are a limit of 150 concurrent nodes (2400 cores) (this limit is shared among all IAC's users who are running jobs at the same time). The number of free nodes could change depending on how many of them are switched on (some of the nodes may be off due to power saving policies). You can check the maximum available cores at any moment using command scontrol show part, where parameter PartitionName will show the name of the queue and TotalCPUs the number of assigned cores to that queue.

  • Software: Use command module avail to see what software is available and load it (see the TeideHPC documentation

  • Parallel enviroments: MPI (mpi) and OpenMP (omp).

  • Memory: Normal computational nodes have a total memory of 32GB to be shared among all slots (normally 16 cores) and also the S.O., etc. That means the usual amount of memory per process is a little bit less than 2GB (you can use less cores per node to get up to 32 GB per process, although the non-used cores will still count in the total consumed time). There are some nodes with 64 GB and also a few ones with a bigger amount of RAM, up to 256 GB (please, contact us if your application requires a huge amount of memory).

  • Disk: By default you have the following storage capacity (per user):

    • home: 5GB.

    • data: 1TB (network SATA), you can access it using the data link located in your home directory (~/data).

    • lustre: 1TB to be used like a scratch while executing (do NOT use it for a long-term storage). You can access it using the lustre link located in your home directory ~/lustre

    • scratch: it is not active by default, contact us if you need it.

    • local: 300GB in the local HDD of each node (for temporary files if needed, all data will be automatically erased when execution finishes).

    • Note: there are NO backup policies.

Note

These limits are the default ones. If your programs need more resources (time, memory, disk space, etc.), contact us sending an email to res_support@iac.es.

Global time limit

Since Feb 2015 there is a global time limit of 87.500 hours per week that affects all IAC's researchers' jobs. Once this limit is reached, no jobs will be killed, but queued or new jobs will not start till next Saturday at 20:00. Since Apr 2015 the non-used hours in one week will be added to the following week, but limit is always reset to 87500 h the first week of each month.

Fair usage policy

Important

In the IAC TeideHPC users' meeting that took place on 7th October 2015, we all have agreed to limit the total amount of hours of each user's jobs, to avoid that one or just a few users use up all the available time. This means that each user can submit any number of jobs insofar they do not exceed 12000 hours in total.

This limit of 12.000 hours per user was chosen based on the number of concurrent active users who have been submitting jobs the last year, and we will use it as a starting point; it may change in the future and we will inform you if so. This limit concerns the requested time of all jobs submitted to the queue by each user, regardless of their status (pending, running, etc.)

What to do if you exceed the limit

If you exceed the 12000-hour limit, please:

  1. Cancel some of your jobs

  2. If possible, submit again your jobs using a lower walltime and/or number of nodes or wait till some of your previous jobs finish and then submit again the cancelled jobs if they do not exceed the limit

Acknowledgments

Please, add next text to your publications if they are based on the results obtained in TeideHPC:

Acknowledgments

The author(s) wish to acknowledge the contribution of Teide High-Performance Computing facilities to the results of this research. TeideHPC facilities are provided by the Instituto Tecnológico y de Energías Renovables (ITER, SA). URL: http://teidehpc.iter.es

The use of TeideHPC should also be acknowledged when filling the "Informe Anual" of the projects. When introducing a refereed publications in the section "Produccion Cientifica", add as a used resource the following: "Supercomputing: TeideHPC".

Who can give me support if I have any issue?