Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:centro:servizos:hpc [2022/07/01 12:46] fernando.guillenen:centro:servizos:hpc [2024/03/13 10:37] (current) – [Sending a job to the queue system] fernando.guillen
Line 18: Line 18:
 |  hpc-node[3-9]              Dell R740    2 x Intel Xeon Gold 5220R @2,2 GHz (24c)        192 GB    -                           | |  hpc-node[3-9]              Dell R740    2 x Intel Xeon Gold 5220R @2,2 GHz (24c)        192 GB    -                           |
 |  hpc-fat1                  |  Dell R840    4 x Xeon Gold 6248 @ 2.50GHz (20c)              1 TB      -                           | |  hpc-fat1                  |  Dell R840    4 x Xeon Gold 6248 @ 2.50GHz (20c)              1 TB      -                           |
-|  <del>hpc-gpu1</del> |  Dell R740   |  x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)    192 GB    2x Nvidia Tesla V100S       | +|  hpc-gpu[1-2  Dell R740    2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)    192 GB    2x Nvidia Tesla V100S       |
-|  hpc-gpu2   Dell R740    2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)    192 GB    2x Nvidia Tesla V100S       |+
 |  hpc-gpu3                  |  Dell R7525  |  2 x AMD EPYC 7543 @2,80 GHz (32c)              |  256 GB    2x Nvidia Ampere A100 40GB  | |  hpc-gpu3                  |  Dell R7525  |  2 x AMD EPYC 7543 @2,80 GHz (32c)              |  256 GB    2x Nvidia Ampere A100 40GB  |
 |  hpc-gpu4                  |  Dell R7525  |  2 x AMD EPYC 7543 @2,80 GHz (32c)              |  256 GB    1x Nvidia Ampere A100 80GB  | |  hpc-gpu4                  |  Dell R7525  |  2 x AMD EPYC 7543 @2,80 GHz (32c)              |  256 GB    1x Nvidia Ampere A100 80GB  |
-* Now ctgpgpu8. It will be integrated in the cluster soon. + 
-===== Accessing the system =====+===== Accessing the cluster =====
 To access the cluster, access must be requested in advance via [[https://citius.usc.es/uxitic/incidencias/add|incident form]]. Users who do not have access permission will receive an "incorrect password" message. To access the cluster, access must be requested in advance via [[https://citius.usc.es/uxitic/incidencias/add|incident form]]. Users who do not have access permission will receive an "incorrect password" message.
  
Line 114: Line 113:
   * Python 3.6.8   * Python 3.6.8
   * Perl 5.26.3   * Perl 5.26.3
 +GPU nodes, in addition: 
 +  * nVidia Driver 510.47.03 
 +  * CUDA 11.6 
 +  * libcudnn 8.7
 To use any other software not installed on the system or another version of the system, there are three options: To use any other software not installed on the system or another version of the system, there are three options:
   - Use Modules with the modules that are already installed (or request the installation of a new module if it is not available).   - Use Modules with the modules that are already installed (or request the installation of a new module if it is not available).
Line 143: Line 145:
 === uDocker ==== === uDocker ====
 [[ https://indigo-dc.gitbook.io/udocker/user_manual | uDocker manual ]] \\ [[ https://indigo-dc.gitbook.io/udocker/user_manual | uDocker manual ]] \\
-uDocker is installed as a module, so it needs to be loaded into the environment:+udocker is installed as a module, so it needs to be loaded into the environment:
 <code bash> <code bash>
 ml uDocker ml uDocker
Line 161: Line 163:
 # Install  # Install 
 sh Miniconda3-py39_4.11.0-Linux-x86_64.sh sh Miniconda3-py39_4.11.0-Linux-x86_64.sh
 +#  Initialize for bash shell
 +~/miniconda3/bin/conda init bash
 </code> </code>
  
Line 168: Line 172:
 == Available resources == == Available resources ==
 <code bash> <code bash>
 +hpc-login2 ~]# ver_estado.sh
 +=============================================================================================================
 +  NODO     ESTADO                        CORES EN USO                           USO MEM     GPUS(Uso/Total)
 +=============================================================================================================
 + hpc-fat1    up   0%[--------------------------------------------------]( 0/80) RAM:  0%     ---
 + hpc-gpu1    up   2%[||------------------------------------------------]( 1/36) RAM: 47%   V100S (1/2)
 + hpc-gpu2    up   2%[||------------------------------------------------]( 1/36) RAM: 47%   V100S (1/2)
 + hpc-gpu3    up   0%[--------------------------------------------------]( 0/64) RAM:  0%   A100_40 (0/2)
 + hpc-gpu4    up   1%[|-------------------------------------------------]( 1/64) RAM: 35%   A100_80 (1/1)
 + hpc-node1   up   0%[--------------------------------------------------]( 0/36) RAM:  0%     ---
 + hpc-node2   up   0%[--------------------------------------------------]( 0/36) RAM:  0%     ---
 + hpc-node3   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 + hpc-node4   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 + hpc-node5   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 + hpc-node6   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 + hpc-node7   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 + hpc-node8   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 + hpc-node9   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 +=============================================================================================================
 +TOTALES: [Cores : 3/688] [Mem(MB): 270000/3598464] [GPU: 3/ 7]
 hpc-login2 ~]$ sinfo -e -o "%30N  %20c  %20m  %20f  %30G " --sort=N hpc-login2 ~]$ sinfo -e -o "%30N  %20c  %20m  %20f  %30G " --sort=N
 # There is an alias for that command: # There is an alias for that command:
Line 238: Line 262:
 # There is an alias that shows only the relevant info: # There is an alias that shows only the relevant info:
 hpc-login2 ~]$ ver_colas hpc-login2 ~]$ ver_colas
-      Name   Priority           Flags UsageFactor                     MaxTRES     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU  +      Name    Priority                                  MaxTRES     MaxWall            MaxTRESPU MaxJobsPU MaxSubmitPU  
----------- ---------- --------------- ----------- --------------------------- ----------- ------------- --------- -----------  +----------  ---------- ---------------------------------------- ----------- -------------------- --------- -----------  
-   regular        100     DenyOnLimit    1.000000   cpu=200,gres/gpu=1,node=4  4-04:00:00                      10          50  +   regular         100                cpu=200,gres/gpu=1,node=4  4-04:00:00       cpu=200,node=4        10          50  
-interactive       200     DenyOnLimit    1.000000                      node=1    04:00:00        node=1                   1  +interactive        200                                   node=1    04:00:00               node=1                   1  
-    urgent        300     DenyOnLimit    2.000000           gres/gpu=1,node=1    04:00:00        cpu=36                  15  +    urgent         300                        gres/gpu=1,node=1    04:00:00               cpu=36                  15  
-      long        100     DenyOnLimit    1.000000           gres/gpu=1,node=4  8-08:00:00                                      +      long         100                        gres/gpu=1,node=4  8-04:00:00                                         
-     large        100     DenyOnLimit    1.000000          cpu=200,gres/gpu=2  4-04:00:00                      10          25  +     large         100                       cpu=200,gres/gpu=2  4-04:00:00                                       10  
-     admin        500                    0.000000 +     admin         500                                                                                                  
 +     small         150                             cpu=6,node=2    04:00:00              cpu=400        40         100 
 </code> </code>
 # Priority: is the relative priority of each queue. \\ # Priority: is the relative priority of each queue. \\
Line 258: Line 283:
 ==== Sending a job to the queue system ==== ==== Sending a job to the queue system ====
 == Requesting resources == == Requesting resources ==
-By default, if you submit a job without specifying anything, the system submits it to the default (regular) QOS and assigns it a node, a CPU and all available memory. The time limit for job execution is that of the queue (4 days and 4 hours). +By default, if you submit a job without specifying anything, the system submits it to the default (regular) QOS and assigns it a node, a CPU and 4 GB. The time limit for job execution is that of the queue (4 days and 4 hours). 
 This is very inefficient, the ideal is to specify as much as possible at least three parameters when submitting jobs: This is very inefficient, the ideal is to specify as much as possible at least three parameters when submitting jobs:
   -  %%Node number (-N or --nodes), tasks (-n or --ntasks) and/or CPUs per task (-c or --cpus-per-task).%%   -  %%Node number (-N or --nodes), tasks (-n or --ntasks) and/or CPUs per task (-c or --cpus-per-task).%%
Line 295: Line 320:
  
 == Job submission == == Job submission ==
 +  - sbatch
   - salloc   - salloc
   - srun   - srun
-  - sbatch 
  
-1. SALLOC \\ +1. SBATCH \\
-It is used to immediately obtain an allocation of resources (nodes). As soon as it is obtained, the specified command or a shell is executed.  +
-<code bash> +
-# Get 5 nodes and launch a job. +
-hpc-login2 ~]$ salloc -N5 myprogram +
-# Get interactive access to a node (Press Ctrl+D to exit): +
-hpc-login2 ~]$ salloc -N1  +
-</code> +
-2. SRUN \\ +
-It is used to launch a parallel job (preferable to using mpirun). It is interactive and blocking. +
-<code bash> +
-# Launch the hostname command on 2 nodes +
-hpc-login2 ~]$ srun -N2 hostname +
-hpc-node1 +
-hpc-node2 +
-</code> +
-3. SBATCH \\+
 Used to send a script to the queuing system. It is batch-processing and non-blocking. Used to send a script to the queuing system. It is batch-processing and non-blocking.
 <code bash> <code bash>
Line 334: Line 343:
 hpc-login2 ~]$ sbatch test_job.sh  hpc-login2 ~]$ sbatch test_job.sh 
 </code> </code>
 +2. SALLOC \\
 +It is used to immediately obtain an allocation of resources (nodes). As soon as it is obtained, the specified command or a shell is executed. 
 +<code bash>
 +# Get 5 nodes and launch a job.
 +hpc-login2 ~]$ salloc -N5 myprogram
 +# Get interactive access to a node (Press Ctrl+D to exit):
 +hpc-login2 ~]$ salloc -N1 
 +# Get interactive EXCLUSIVE access to a node
 +hpc-login2 ~]$ salloc -N1 --exclusive
 +</code>
 +3. SRUN \\
 +It is used to launch a parallel job (preferable to using mpirun). It is interactive and blocking.
 +<code bash>
 +# Launch the hostname command on 2 nodes
 +hpc-login2 ~]$ srun -N2 hostname
 +hpc-node1
 +hpc-node2
 +</code>
 +
  
 ==== GPU use ==== ==== GPU use ====
Line 372: Line 400:
 By default these are the output codes of the commands: By default these are the output codes of the commands:
 ^  SLURM command  ^  Exit code  ^ ^  SLURM command  ^  Exit code  ^
-|  salloc  |  0 en caso de éxito, 1 si no se puedo ejecutar el comando del usuario  | +|  salloc  |  0 success, 1 if the user's command cannot be executed  | 
-|  srun  |  El más alto de entre todas las tareas ejecutadas o 253 para un error out-of-mem +|  srun  |  The highest among all executed tasks or 253 for an out-of-mem error.  | 
-|  sbatch  |  0 en caso de éxitosi noel código de salida correspondiente del proceso que falló  |+|  sbatch  |  0 successif notthe corresponding exit code of the failed process  |
  
 == STDIN, STDOUT y STDERR == == STDIN, STDOUT y STDERR ==
 **SRUN:**\\ **SRUN:**\\
-Por defecto stdout stderr se redirigen de todos los TASKS a el stdout stderr de srunstdin se redirecciona desde el stdin de srun a todas las TASKS. Esto se puede cambiar con+By default stdout and stderr are redirected from all TASKS to srun'stdout and stderr, and stdin is redirected from srun's stdin to all TASKS. This can be changed with
-|  %%-i, --input=<opcion>%%    |  +|  %%-i, --input=<option>%%    |  
-|  %%-o, --output=<opcion>%%   | +|  %%-o, --output=<option>%%   | 
-|  %%-e, --error=<opcion>%%   | +|  %%-e, --error=<option>%%   | 
-Y las opciones son+And options are
-  * //all//: opción por defecto+  * //all//: by default
-  * //none//: No se redirecciona nada+  * //none//: Nothing is redirected
-  * //taskid//: Solo se redirecciona desde y/o al TASK id especificado+  * //taskid//: Redirects only to and/or from the specified TASK id. 
-  * //filename//: Se redirecciona todo desde y/o al fichero especificado+  * //filename//: Redirects everything to and/or from the specified file
-  * //filename pattern//: Igual que filename pero con un fichero definido por un [[ https://slurm.schedmd.com/srun.html#OPT_filename-pattern | patrón ]]+  * //filename pattern//: Same as the filename option but with a file defined by a [[ https://slurm.schedmd.com/srun.html#OPT_filename-pattern | pattern ]].
  
 **SBATCH:**\\ **SBATCH:**\\
-Por defecto "/dev/null" está abierto en el stdin del script stdout stderror se redirigen un fichero de nombre "slurm-%j.out"Esto se puede cambiar con:+By default "/dev/null" is open in the script's stdin and stdout and stderror are redirected to file named "slurm-%j.out"This can be changed with:
 |  %%-i, --input=<filename_pattern>%%  | |  %%-i, --input=<filename_pattern>%%  |
 |  %%-o, --output=<filename_pattern>%%  | |  %%-o, --output=<filename_pattern>%%  |
 |  %%-e, --error=<filename_pattern>%%  | |  %%-e, --error=<filename_pattern>%%  |
-La referencia de filename_pattern está [[ https://slurm.schedmd.com/sbatch.html#SECTION_%3CB%3Efilename-pattern%3C/B%3E | aquí ]].+The reference of filename_pattern is [[ https://slurm.schedmd.com/sbatch.html#SECTION_%3CB%3Efilename-pattern%3C/B%3E | here ]].
  
-==== Envío de correos ==== +==== Sending mail ==== 
-Se pueden configurar los JOBS para que envíen correos en determinadas circunstancias usando estos dos parámetros (**SON NECESARIOS AMBOS**): +JOBS can be configured to send mail in certain circumstances using these two parameters (**BOTH ARE REQUIRED**): 
-|  %%--mail-type=<type>%%  |  Opciones: BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_50. +|  %%--mail-type=<type>%%  |  Options: BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_50. 
-|  %%--mail-user=<user>%%  |  La dirección de correo de destino.  |+|  %%--mail-user=<user>%%  |  The destination mailing address.  |
  
  
  
-==== Estados de los trabajos en el sistema de colas ====+==== Status of Jobs in the queuing system ====
 <code bash> <code bash>
 hpc-login2 ~]# squeue -l hpc-login2 ~]# squeue -l
 JOBID PARTITION     NAME     USER      STATE       TIME  NODES NODELIST(REASON) JOBID PARTITION     NAME     USER      STATE       TIME  NODES NODELIST(REASON)
 6547  defaultPa  example <username>  RUNNING   22:54:55      1 hpc-fat1 6547  defaultPa  example <username>  RUNNING   22:54:55      1 hpc-fat1
 +
 +## Check status of queue use:
 +hpc-login2 ~]$ estado_colas.sh
 +JOBS PER USER:
 +--------------
 +       usuario.uno:  3
 +       usuario.dos:  1
 +
 +JOBS PER QOS:
 +--------------
 +             regular:  3
 +                long:  1
 +
 +JOBS PER STATE:
 +--------------
 +             RUNNING:  3
 +             PENDING:  1
 +==========================================
 +Total JOBS in cluster:  4
 </code> </code>
-Estados (STATE) más comunes de un trabajo:+Common job states:
   * R RUNNING Job currently has an allocation.   * R RUNNING Job currently has an allocation.
   * CD COMPLETED Job has terminated all processes on all nodes with an exit code of zero.    * CD COMPLETED Job has terminated all processes on all nodes with an exit code of zero. 
Line 415: Line 462:
   * PD PENDING Job is awaiting resource allocation.   * PD PENDING Job is awaiting resource allocation.
    
-[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-STATE-CODES | Lista completa de posibles estados de un trabajo ]].\\+[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-STATE-CODES | Full list of possible job statuses ]].\\
  
-Si un trabajo no está en ejecución aparecerá una razón debajo de REASON:[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES | Lista de las razones ]] por las que un trabajo puede estar esperando su ejecución.+If a job is not running, a reason will be displayed underneath REASON:[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES | reason list ]] for which a job may be awaiting execution.