Differences

This shows you the differences between two versions of the page.

--- en:centro:servizos:hpc [2022/06/30 10:58] – fernando.guillen
+++ en:centro:servizos:hpc [2024/03/13 10:37] (current) – [Sending a job to the queue system] fernando.guillen
@@ Line 1: / Line 1: @@
-====== Cluster de Computación de Altas Prestacións (HPC) ctcomp3  ======
+====== High Performance Computing (HPC) cluster ctcomp3  ======
-[[ https://web.microsoftstream.com/video/f5eba154-b597-4440-9307-3befd7597d78 | Video de la presentación del servicio (7/3/22) ]]
+[[ https://web.microsoftstream.com/video/f5eba154-b597-4440-9307-3befd7597d78 | Video of the presentation of the service (7/3/22) (Spanish only) ]]
-===== Descripción =====
+===== Description =====
-El clúster está compuesto en la parte de cómputo por:
+The computing part of the cluster is made up of:
-  *  9 servidores para cómputo general.
+  * 9 servers for general computing.
-  *  1 "fat node" para trabajos que requieran mucha memoria.
+  * 1 "fat node" for memory-intensive jobs.
-  *  4 servidores para computo con GPU.
+  * 4 servers for GPU computing.
-Los usuarios solo tienen acceso directo al nodo de login, de prestaciones más limitadas y que no debe usarse para computar. \\
+Users only have direct access to the login node, which has more limited features and should not be used for computing. \\
-Todos los nodos están interconectados por una red a 10Gb. \\
+All nodes are interconnected by a 10Gb network. \\
-Hay un almacenamiento distribuido accesible desde todos los nodos con 220 TB de capacidad conectado mediante una doble red de fibra de 25Gb. \\
+There is distributed storage accessible from all nodes with 220 TB of capacity connected by a dual 25Gb fibre network. \\
 \\
-^  Nombre                    ^  Modelo      ^  Procesador                                     ^  Memoria  ^  GPU                         ^
+^  Name                    ^  Model      ^  Processor                                     ^  Memory  ^  GPU                         ^
 |  hpc-login2                |  Dell R440   |  1 x Intel Xeon Silver 4208 CPU @ 2.10GHz (8c)  |  16 GB    |  -                           |
 |  hpc-node[1-2]             |  Dell R740   |  2 x Intel Xeon Gold 5220 @2,2 GHz (18c)        |  192 GB   |  -                           |
 |  hpc-node[3-9]             |  Dell R740   |  2 x Intel Xeon Gold 5220R @2,2 GHz (24c)       |  192 GB   |  -                           |
 |  hpc-fat1                  |  Dell R840   |  4 x Xeon Gold 6248 @ 2.50GHz (20c)             |  1 TB     |  -                           |
-|  <del>hpc-gpu1</del>*  |  Dell R740   |  2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)   |  192 GB   |  2x Nvidia Tesla V100S       |
+|  hpc-gpu[1-2]  |  Dell R740   |  2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)   |  192 GB   |  2x Nvidia Tesla V100S       |
-|  hpc-gpu2  |  Dell R740   |  2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)   |  192 GB   |  2x Nvidia Tesla V100S       |
 |  hpc-gpu3                  |  Dell R7525  |  2 x AMD EPYC 7543 @2,80 GHz (32c)              |  256 GB   |  2x Nvidia Ampere A100 40GB  |
 |  hpc-gpu4                  |  Dell R7525  |  2 x AMD EPYC 7543 @2,80 GHz (32c)              |  256 GB   |  1x Nvidia Ampere A100 80GB  |
-* Es ctgpgpu8. Se integrará próximamente en cluster.
-===== Conexión al sistema =====
-Para acceder al clúster, hay que solicitarlo previamente a través de [[https://citius.usc.es/uxitic/incidencias/add|formulario de incidencias]]. Los usuarios que no tengan permiso de acceso recibirán un mensaje de "contraseña incorrecta".
-El acceso se realiza mediante una conexión SSH al nodo de login:
+===== Accessing the cluster =====
+To access the cluster, access must be requested in advance via [[https://citius.usc.es/uxitic/incidencias/add|incident form]]. Users who do not have access permission will receive an "incorrect password" message.
+The access is done through an SSH connection to the login node:
 <code bash>
 ssh <nombre_de_usuario>@hpc-login2.inv.usc.es
 </code>
-=====  Almacenamiento, directorios y sistemas de ficheros  =====
+=====  Storage, directories and filesystems  =====
-<note warning> No se hace copia de seguridad de ninguno de los sistemas de ficheros del cluster!!</note>
+<note warning> None of the file systems in the cluster are backed up!!!</note>
-El HOME de los usuarios en el cluster está en el sistema compartido de ficheros, por lo que es accesible desde todos los nodos del cluster. Ruta definida en la variable de entorno %%$HOME%%. \\
+The HOME of the users in the cluster is on the file share system, so it is accessible from all nodes in the cluster. Path defined in the environment variable %%$HOME%%. \\
-Cada nodo tiene una partición local de 1 TB para scratch, que se borra al terminar cada trabajo. Se puede acceder mediante la variable de entorno %%$LOCAL_SCRATCH%% en los scripts. \\
+Each node has a local 1TB scratch partition, which is deleted at the end of each job. It can be accessed through the %%$LOCAL_SCRATCH%% environment variable in the scripts. \\
-Para datos que deban ser compartidos por grupos de usuarios, hay que solicitar la creación de una carpeta en el almacenamiento compartido que solo será accesible por los miembros del grupo.\\
+For data to be shared by groups of users, you must request the creation of a folder in the shared storage that will only be accessible by members of the group.\\
-^  Directorio        ^  Variable               ^  Punto de montaje             ^  Capacidad  ^
+^  Directory        ^  Variable               ^  Mount point             ^  Capacity  ^
 |  Home              |  %%$HOME%%              |  /mnt/beegfs/home/<username>  |  220 TB*    |
-|  Scratch local     |  %%$LOCAL_SCRATCH%%     |  varía                        |  1 TB       |
+|  local Scratch      |  %%$LOCAL_SCRATCH%%     |  varía                        |  1 TB       |
-|  Carpeta de grupo  |  %% $GRUPOS/<nombre>%%  |  /mnt/beegfs/groups/<nombre>  |  220 TB*    |
+|  Group folder  |  %% $GRUPOS/<nombre>%%  |  /mnt/beegfs/groups/<nombre>  |  220 TB*    |
-%%* el almacenamiento es compartido%%
+%%* storage is shared %%
-=== AVISO IMPORTANTE ===
+=== WARNING ===
-El sistema compartido de archivos tiene un mal rendimiento cuando trabaja con muchos archivos de tamaño pequeño. Para mejorar el rendimiento en ese tipo de escenarios hay que crear un sistema de archivos en un fichero de imagen y montarlo para trabajar directamente sobre él. El procedimiento es el siguiente:
+The file share system performs poorly when working with many small files. To improve performance in such scenarios, create a file system in an image file and mount it to work directly on it. The procedure is as follows:
-  * Crear el fichero de imagen en tu home:
+  * Create the image file at your home folder:
 <code bash>
 ## truncate image.name -s SIZE_IN_BYTES
-truncate ejemplo.ext4 -s 20G
+truncate example.ext4 -s 20G
 </code>
-  *  Crear un sistema de archivos en el fichero de imagen:
+  *  Create a filesystem in the image file:
 <code bash>
 ## mkfs.ext4 -T small -m 0 image.name
-## -T small opciones optimizadas para archivos pequeños
+## -T small optimized options for small files
-## -m 0 No reservar espacio para root
+## -m 0 Do not reserve capacity for root user
-mkfs.ext4 -T small -m 0 ejemplo.ext4
+mkfs.ext4 -T small -m 0 example.ext4
 </code>
-  * Montar la imagen (usando SUDO) con el script  //mount_image.py// :
+  * Mount the image (using SUDO) with the script  //mount_image.py// :
 <code bash>
-## Por defecto queda montada en /mnt/imagenes/<username>/ en modo solo lectura.
+## By default it is mounted at /mnt/imagenes/<username>/ in read-only mode.
-sudo mount_image.py ejemplo.ext4
+sudo mount_image.py example.ext4
 </code>
-  * Para desmontar la imagen usar el script //umount_image.py// (usando SUDO)
+  * To unmount the image use the script //umount_image.py// (using SUDO)
-El script de montaje tiene estas opciones:
+The mount script has this options:
 <code>
---mount-point path   <-- (opcional)Con esta opción crea subdirectorios por debajo de /mnt/imagenes/<username>/<path>
+--mount-point path   <-- (optional) This option creates subdirectories under /mnt/imagenes/<username>/<path>
---rw                  <-- (opcional)Por defecto se monta readonly, con esta opción se monta readwrite.
+--rw                  <-- (optional) By default it is mounted readonly, with this option it is mounted readwrite.
 </code>
-El script de desmontaje tiene estas opciones:
+<note warning> Do not mount the image file readwrite from more than one node!!!</note>
-<code>solo admite como parámetro opcional el mismo path que hayas usado para el montaje con la opción
---mount-point  <-- (opcional)
+The unmounting script has this options:
+<code>only supports as an optional parameter the same path you have used when mounting with the option
+--mount-point  <-- (optional)
 </code>
-=====  Transferencia de ficheros y datos  =====
+=====  Transference of files and data  =====
 === SCP ===
-Desde tu máquina local al cluster:
+From your local machine to the cluster:
 <code bash>
-scp filename <username>@hpc-login2:/<ruta>
+scp filename <username>@hpc-login2:/<path>
 </code>
-Desde el cluster a tu máquina local:
+From the cluster to your local machine:
 <code bash>
-scp filename <username>@<hostname>:/<ruta>
+scp filename <username>@<hostname>:/<path>
 </code>
-[[https://man7.org/linux/man-pages/man1/scp.1.html | Página del manual de SCP]]
+[[https://man7.org/linux/man-pages/man1/scp.1.html | SCP man page]]
 === SFTP ===
-Para transferir múltiples archivos o para navegar por el sistema de archivos.
+To transfer several files or to navigate through the filesystem.
 <code bash>
 <hostname>:~$ sftp <user_name>@hpc-login2
@@ Line 92: / Line 94: @@
 sftp> quit
 </code>
-[[https://www.unix.com/man-page/redhat/1/sftp/ | Página del manual de SFTP]]
+[[https://www.unix.com/man-page/redhat/1/sftp/ | SFTP man page]]
 === RSYNC ===
-[[ https://rsync.samba.org/documentation.html | Documentación de RSYNC ]]
+[[ https://rsync.samba.org/documentation.html | RSYNC documentation ]]
 === SSHFS ===
-Requiere la instalación del paquete sshfs.\\
+Requires local installation of the sshfs package.\\
-Permite por ejemplo montar el home del equipo del usuario en hpc-login2:
+Allows for example to mount the user's local home in hpc-login2:
 <code bash>
-## Montar
+## Mount
-sshfs  <username>@ctdeskxxx.inv.usc.es:/home/<username> <punto_de_montaje>
+sshfs  <username>@ctdeskxxx.inv.usc.es:/home/<username> <mount_point>
-## Desmontar
+## Unmount
-fusermount -u <punto_de_montaje>
+fusermount -u <mount_point>
 </code>
-[[https://linux.die.net/man/1/sshfs | Página del manual de SSHFS]]
+[[https://linux.die.net/man/1/sshfs | SSHFS man page]]
-===== Software disponible =====
+===== Available Software =====
-Todos los nodos tienen el software básico que se instala por defecto con AlmaLinux 8.4, particularmente:
+All nodes have the basic software that is installed by default in AlmaLinux 8.4, in particular:
   * GCC 8.5.0
   * Python 3.6.8
   * Perl 5.26.3
+GPU nodes, in addition:
+  * nVidia Driver 510.47.03
+  * CUDA 11.6
+  * libcudnn 8.7
+To use any other software not installed on the system or another version of the system, there are three options:
+  - Use Modules with the modules that are already installed (or request the installation of a new module if it is not available).
+  - Use a container (uDocker or Apptainer/Singularity)
+  - Use Conda
+A module is the simplest solution for using software without modifications or difficult to satisfy dependencies.\\
+A container is ideal when dependencies are complicated and/or the software is highly customised. It is also the best solution if you are looking for reproducibility, ease of distribution and teamwork.\\
+Conda is the best solution if you need the latest version of a library or program or packages not otherwise available.\\
-Para usar cualquier otro software no instalado en el sistema u otra versión del mismo hay tres opciones:
+==== Modules/Lmod use====
-  - Usar Modules con los módulos que ya están instalados (o solicitar la instalación de un nuevo módulo si no está disponible)
+[[ https://lmod.readthedocs.io/en/latest/010_user.html | Lmod documentation]]
-  - Usar un contenedor (uDocker o Apptainer/Singularity)
-  - Usar Conda
-Un módulo es la solución más sencilla para usar software sin modificaciones o dependencias difíciles de satisfacer.\\
-Un contenedor es ideal cuando las dependencias son complicadas y/o el software está muy personalizado. También es la mejor solución si lo que se busca es reproducibilidad, facilidad para su distribución y trabajo en equipo.\\
-Conda es la mejor solución si lo que se necesita es la última versión de una librería o programa o paquetes no disponibles de otra forma.\\
-==== Uso de modules/Lmod ====
-[[ https://lmod.readthedocs.io/en/latest/010_user.html | Documentación de Lmod ]]
 <code bash>
-# Ver los módulos disponibles:
+# See available modules:
 module avail
-# Cargar un módulo:
+# Module load:
-module <nombre_modulo>
+module <module_name>
-# Descargar un módulo:
+# Unload a module:
-module unload <nombre_modulo>
+module unload <module_name>
-# Ver módulos cargados en tu entorno:
+# List modules loaded in your environment:
 module list
-# Puede usarse ml como abreviatura del comando module:
+# ml can be used as a shorthand of the module command:
 ml avail
-# Para obtener información sobre un módulo:
+# To get info of a module:
-ml spider <nombre_modulo>
+ml spider <module_name>
 </code>
+==== Software containers execution ====
-==== Ejecución de contenedores de software ====
 === uDocker ====
-[[ https://indigo-dc.gitbook.io/udocker/user_manual | Manual de uDocker]] \\
+[[ https://indigo-dc.gitbook.io/udocker/user_manual | uDocker manual ]] \\
-uDocker está instalado como un módulo, así que es necesario cargarlo en el entorno:
+udocker is installed as a module, so it needs to be loaded into the environment:
 <code bash>
 ml uDocker
@@ Line 149: / Line 151: @@
 === Apptainer/Singularity ===
-[[ https://sylabs.io/guides/3.8/user-guide/ | Documentacion de Apptainer/Singularity ]] \\
+[[ https://sylabs.io/guides/3.8/user-guide/ | Apptainer/Singularity documentation]] \\
-Apptainer/Singularity está instalado en el sistema de cada nodo, por lo que no es necesario hacer nada para usarlo.
+Apptainer/Singularity is installed on each node's system, so you don't need to do anything to use it.
 ==== CONDA ====
-[[ https://docs.conda.io/en/latest/miniconda.html | Documentacion de Conda ]] \\
+[[ https://docs.conda.io/en/latest/miniconda.html | Conda Documentation ]] \\
-Miniconda es la versíon mínima de Anaconda y solo incluye el gestor de entornos conda, Python y unos pocos paquetes necesarios. A partir de ahí cada usuario solo descarga e instala los paquetes que necesita.
+Miniconda is the minimal version of Anaconda and only includes the conda environment manager, Python and a few necessary packages. From there on, each user only downloads and installs the packages they need.
 <code bash>
-# Obtener miniconda
+# Getting miniconda
 wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.11.0-Linux-x86_64.sh
-# Instalarlo
+# Install
 sh Miniconda3-py39_4.11.0-Linux-x86_64.sh
+#  Initialize for bash shell
+~/miniconda3/bin/conda init bash
 </code>
-===== Uso de SLURM =====
+===== Using SLURM =====
-El gestor de colas en el cluster es [[ https://slurm.schedmd.com/documentation.html | SLURM ]]. \\
+The cluster queue manager is[[ https://slurm.schedmd.com/documentation.html | SLURM ]]. \\
-<note tip>El término CPU identifica a un core físico de un socket. El hyperthreading está desactivado, por lo que cada nodo tiene disponibles tantas CPU como (nº sockets) * (nº cores físico por socket) tenga.</note>
+<note tip>The term CPU identifies a physical core in a socket. Hyperthreading is disabled, so each node has as many CPUs available as (number of sockets) * (number of physical cores per socket) it has.</note>
-== Recursos disponibles ==
+== Available resources ==
 <code bash>
+hpc-login2 ~]# ver_estado.sh
+=============================================================================================================
+  NODO     ESTADO                        CORES EN USO                           USO MEM     GPUS(Uso/Total)
+=============================================================================================================
+ hpc-fat1    up   0%[--------------------------------------------------]( 0/80) RAM:  0%     ---
+ hpc-gpu1    up   2%[||------------------------------------------------]( 1/36) RAM: 47%   V100S (1/2)
+ hpc-gpu2    up   2%[||------------------------------------------------]( 1/36) RAM: 47%   V100S (1/2)
+ hpc-gpu3    up   0%[--------------------------------------------------]( 0/64) RAM:  0%   A100_40 (0/2)
+ hpc-gpu4    up   1%[|-------------------------------------------------]( 1/64) RAM: 35%   A100_80 (1/1)
+ hpc-node1   up   0%[--------------------------------------------------]( 0/36) RAM:  0%     ---
+ hpc-node2   up   0%[--------------------------------------------------]( 0/36) RAM:  0%     ---
+ hpc-node3   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node4   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node5   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node6   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node7   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node8   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node9   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+=============================================================================================================
+TOTALES: [Cores : 3/688] [Mem(MB): 270000/3598464] [GPU: 3/ 7]
 hpc-login2 ~]$ sinfo -e -o "%30N  %20c  %20m  %20f  %30G " --sort=N
-# Hay un alias para este comando:
+# There is an alias for that command:
 hpc-login2 ~]$ ver_recursos
 NODELIST                        CPUS                  MEMORY                AVAIL_FEATURES        GRES
@@ Line 179: / Line 203: @@
 hpc-node[3-9]                   48                    187645                cpu_intel             (null)
-# Para ver el uso actual de los recursos: (CPUS (Allocated/Idle/Other/Total))
+# To see current resource use: (CPUS (Allocated/Idle/Other/Total))
 hpc-login2 ~]$ sinfo -N -r -O NodeList,CPUsState,Memory,FreeMem,Gres,GresUsed
-# Hay un alias para este comando:
+# There is an alias for that command:
 hpc-login2 ~]$ ver_uso
 NODELIST            CPUS(A/I/O/T)       MEMORY              FREE_MEM            GRES                GRES_USED
@@ Line 197: / Line 221: @@
 hpc-node9           36/12/0/48          187645              127312              (null)              gpu:0,mps:0
 </code>
-==== Nodos ====
+==== Nodes ====
-Un nodo es la unidad de computación de SLURM, y se corresponde con un servidor físico.
+A node is SLURM's computation unit and corresponds to a physical server.
 <code bash>
-# Mostrar la información de un nodo:
+# Show node info:
 hpc-login2 ~]$ scontrol show node hpc-node1
 NodeName=hpc-node1 Arch=x86_64 CoresPerSocket=18
@@ Line 220: / Line 244: @@
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
 </code>
-==== Particiones ====
+==== Partitions ====
-Las particiones en SLURM son grupos lógicos de nodos. En el cluster hay una única partición a la que pertenecen todos los nodos, por lo que no es necesario especificarla a la hora de enviar trabajos.
+Partitions in SLURM are logical groups of nodes. In the cluster there is a single partition to which all nodes belong, so it is not necessary to specify it when submitting jobs.
 <code bash>
-# Mostrar la información de las particiones:
+# Show partition info:
 hpc-login2 ~]$ sinfo
-defaultPartition*    up   infinite     11   idle hpc-fat1,hpc-gpu[3-4],hpc-node[1-9]
+defaultPartition*    up   infinite     11   idle hpc-fat1,hpc-gpu[1-4],hpc-node[1-9]
-# Cuando se incorporen al cluster ctgpgpu7 y 8 apareceran como los nodos hpc-gpu1 y 2 respectivamente.
 </code>
-==== Trabajos ====
+==== Jobs ====
-Los trabajos en SLURM son asignaciones de recursos a un usuario durante un tiempo determinado. Los trabajos se identifican por un número correlativo o JOBID. \\
+Jobs in SLURM are resource allocations to a user for a given time. Jobs are identified by a sequential number or JOBID. \\
-Un trabajo (JOB) consiste en uno o más pasos (STEPS), cada uno consistente en una o más tareas (TASKS) que usan una o más CPU. Hay un STEP por cada programa que se ejecute de forma secuencial en un JOB y hay un TASK por cada programa que se ejecute en paralelo. Por lo tanto en el caso más simple como por ejemplo lanzar un trabajo consistente en ejecutar el comando hostname el JOB tiene un único STEP y una única TASK.
+A JOB consists of one or more STEPS, each consisting of one or more TASKS that use one or more CPUs. There is one STEP for each program that executes sequentially in a JOB and there is one TASK for each program that executes in parallel. Therefore in the simplest case such as launching a job consisting of executing the hostname command the JOB has a single STEP and a single TASK.
-==== Sistema de colas (QOS) ====
+==== Queue system (QOS) ====
-La cola a la que se envíe cada trabajo define la prioridad,los límites y también el "coste" relativo para el usuario.
+The queue to which each job is submitted defines the priority, the limits and also the relative "cost" to the user.
 <code bash>
-# Mostrar las colas
+# Show queues
 hpc-login2 ~]$ sacctmgr show qos
-# Hay un alias que muestra solo la información más relevante:
+# There is an alias that shows only the relevant info:
 hpc-login2 ~]$ ver_colas
-      Name   Priority           Flags UsageFactor                     MaxTRES     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU
+      Name    Priority                                  MaxTRES     MaxWall            MaxTRESPU MaxJobsPU MaxSubmitPU
----------- ---------- --------------- ----------- --------------------------- ----------- ------------- --------- -----------
+----------  ---------- ---------------------------------------- ----------- -------------------- --------- -----------
-   regular        100     DenyOnLimit    1.000000   cpu=200,gres/gpu=1,node=4  4-04:00:00                      10          50
+   regular         100                cpu=200,gres/gpu=1,node=4  4-04:00:00       cpu=200,node=4        10          50
-interactive       200     DenyOnLimit    1.000000                      node=1    04:00:00        node=1         1           1
+interactive        200                                   node=1    04:00:00               node=1         1           1
-    urgent        300     DenyOnLimit    2.000000           gres/gpu=1,node=1    04:00:00        cpu=36         5          15
+    urgent         300                        gres/gpu=1,node=1    04:00:00               cpu=36         5          15
-      long        100     DenyOnLimit    1.000000           gres/gpu=1,node=4  8-08:00:00
+      long         100                        gres/gpu=1,node=4  8-04:00:00                              1           5
-     large        100     DenyOnLimit    1.000000          cpu=200,gres/gpu=2  4-04:00:00                      10          25
+     large         100                       cpu=200,gres/gpu=2  4-04:00:00                              2          10
-     admin        500                    0.000000
+     admin         500
+     small         150                             cpu=6,node=2    04:00:00              cpu=400        40         100
 </code>
-# Priority: es la prioridad relativa de cada cola. \\
+# Priority: is the relative priority of each queue. \\
-# DenyonLimit: el trabajo no se ejecuta si no cumple los límites de la cola \\
+# DenyonLimit: job will not be executed if it doesn't comply with the queue limits \\
-# UsageFactor: el coste relativo para el usuario de ejecutar un trabajo en esa cola \\
+# UsageFactor: relive cost for the user to execute jobs on that queue \\
-# MaxTRES: límites por cada trabajo \\
+# MaxTRES: limnits applied to each job \\
-# MaxWall: tiempo máximo que puede estar el trabajo en ejecución \\
+# MaxWall: maximum time the job can run \\
-# MaxTRESPU: límites globales por usuario \\
+# MaxTRESPU: global limits per user \\
-# MaxJobsPU: Número máximo de trabajos que un usuario puede tener en ejecución. \\
+# MaxJobsPU: Maximum number of jobs a user can have running simultaneously. \\
-# MaxSubmitPU: Número máximo de trabajos que un usuario puede tener en total encolados y en ejecucuón.\\
+# MaxSubmitPU: Maximum number of jobs that a user can have in total both queued and running.\\
-==== Envío de un trabajo al sistema de colas ====
+==== Sending a job to the queue system ====
-== Especificación de recursos ==
+== Requesting resources ==
-Por defecto, si se envía un trabajo sin especificar nada el sistema lo envia a la QOS por defecto (regular) y le asigna un nodo, una CPU y toda la memoria disponible. El límite de tiempo para la ejecución del trabajo es el de la cola (4 días y 4 horas).
+By default, if you submit a job without specifying anything, the system submits it to the default (regular) QOS and assigns it a node, a CPU and 4 GB. The time limit for job execution is that of the queue (4 days and 4 hours).
-Esto es muy ineficiente, lo ideal es especificar en la medida de lo posible al menos tres parámetros a la hora de enviar los trabajos:
+This is very inefficient, the ideal is to specify as much as possible at least three parameters when submitting jobs:
-  -  %%El número de nodos (-N o --nodes), tareas (-n o --ntasks) y/o CPU por tarea (-c o --cpus-per-task).%%
+  -  %%Node number (-N or --nodes), tasks (-n or --ntasks) and/or CPUs per task (-c or --cpus-per-task).%%
-  -  %%La memoria (--mem) por nodo o la memoria por cpu (--mem-per-cpu).%%
+  -  %%Memory (--mem) per node or memory per cpu (--mem-per-cpu).%%
-  -  %%El tiempo estimado de ejecución del trabajo ( --time )%%
+  -  %%Job execution time ( --time )%%
-A mayores puede ser interesante añadir los siguientes parámetros:
+In addition, it may be interesting to add the following parameters:
-|  -J   |  %%--job-name%%  |Nombre para el trabajo. Por defecto: nombre del ejecutable  |
+|  -J   |  %%--job-name%%  |Job name. Default: executable name  |
-|  -q   |  %%--qos%%       |Nombre de la cola a la que se envía el trabajo. Por defecto: regular  |
+|  -q   |  %%--qos%%       |Name of the queue to which the job is sent. Default: regular  |
-|  -o   |  %%--output%%    |Fichero o patrón de fichero al que se redirige toda la salida estandar y de error.  |
+|  -o   |  %%--output%%    |File or file pattern to which all standard and error output is redirected.  |
-|       |  %%--gres%%      |Tipo y/o número de GPUs que se solicitan para el trabajo.  |
+|       |  %%--gres%%      |Type and/or number of GPUs requested for the job.   |
 |  -C   |  %%--constraint%%  |Para especificar que se quieren nodos con procesadores Intel o AMD (cpu_intel o cpu_amd)  |
-|    |  %%--exclusive%%  |Para solicitar que el trabajo no comparta nodos con otros trabajos.  |
+|    |  %%--exclusive%%  |To specify that you want nodes with Intel or AMD processors (cpu_intel or cpu_amd)  |
-|  -w  |  %%--nodelist%%   |Lista de nodos en los que ejecutar el trabajo  |
+|  -w  |  %%--nodelist%%   |List of nodes to run the job on  |
-== Cómo se asignan los recursos ==
+== How resources are allocated ==
-Por defecto el método de asignación entre nodos es la asignación en bloque ( se asignan todos los cores disponibles en un nodo antes de usar otro). El método de asignación por defecto dentro de cada nodo es la asignación cíclica  (se van repartiendo por igual los cores requeridos entre los sockests disponibles en el nodo).
+The default allocation method between nodes is block allocation (all available cores on a node are allocated before using another node). The default allocation method within each node is cyclic allocation (the required cores are distributed equally among the available sockets in the node).
-== Calculo de la prioridad ==
+== Priority calculation ==
-Cuando se envía un trabajo al sistema de colas, lo primero que ocurre es que se comprueba si los recursos solicitados entran dentro de los límites fijados en la cola correspondiente. Si supera alguno se cancela el envío. \\
+When a job is submitted to the queuing system, the first thing that happens is that the requested resources are checked to see if they fall within the limits set in the corresponding queue. If it exceeds any of them, the submission is cancelled. \\
-Si hay recursos disponibles el trabajo se ejecuta directamente, pero si no es así se encola. Cada trabajo tiene asignada una prioridad que determina el orden en que se ejecutan los trabajos de la cola cuando quedan recursos disponibles. Para determinar la prioridad de cada trabajo se ponderan 3 factores: el tiempo que lleva esperando en la cola (25%), la prioridad fija que tiene la cola(25%) y el fairshare del usuario (50%). \\
+If resources are available, the job is executed directly, but if not, it is queued. Each job is assigned a priority that determines the order in which the jobs in the queue are executed when resources are available. To determine the priority of each job, 3 factors are weighted: the time it has been waiting in the queue (25%), the fixed priority of the queue (25%) and the user's fairshare (50%). \\
-El fairshare es un cálculo dinámico que hace SLURM para cada usuario y es la diferencia entre los recursos asignados y los recursos consumidos a lo largo de los últimos 14 días.
+The fairshare is a dynamic calculation made by SLURM for each user and is the difference between the resources allocated and the resources consumed over the last 14 days.
 <code bash>
 hpc-login2 ~]$ sshare -l
@@ Line 289: / Line 313: @@
 user_name         100    0.071429        4833    0.001726    0.246436
 </code>
-# RawShares: es la cantidad de recursos en términos absolutos asignada al usuario. Es igual para todos los usuarios.\\
+# RawShares: Is the amount of resources allocated to the user in absolute terms . It is the same for all users.\\
-# NormShares: Es la cantidad anterior normalizada a los recursos asignados en total.\\
+# NormShares: This is the above amount normalised to the total allocated resources.\\
-# RawUsage: Es la cantidad de segundos/cpu consumida por todos los trabajos del usuario.\\
+# RawUsage: The number of seconds/cpu consumed by all user jobs.\\
-# NormUsage: Cantidad anterior normalizada al total de segundos/cpu consumidos en el cluster.\\
+# NormUsage: RawUsage normalised to total seconds/cpu consumed in the cluster.\\
-# FairShare: El factor FairShare entre 0 y 1. Cuanto mayor uso del cluster, más se aproximará a 0 y menor será la prioridad.\\
+# FairShare: The FairShare factor between 0 and 1. The higher the cluster usage, the closer to 0 and the lower the priority.\\
-== Envío de trabajos ==
+== Job submission ==
+  - sbatch
   - salloc
   - srun
-  - sbatch
-. SALLOC \\
+. SBATCH \\
-Sirve para obtener de forma inmediata una asignación de recursos (nodos). En cuanto se obtiene se ejecuta el comando especificado o una shell en su defecto.
+Used to send a script to the queuing system. It is batch-processing and non-blocking.
-<code bash>
-# Obtener 5 nodos y lanzar un trabajo.
-hpc-login2 ~]$ salloc -N5 myprogram
-# Obtener acceso interactivo a un nodo (Pulsar Ctrl+D para terminar el acceso):
-hpc-login2 ~]$ salloc -N1
-</code>
-. SRUN \\
-Sirve para lanzar un trabajo paralelo ( es preferible a usar mpirun ). Es interactivo y bloqueante.
-<code bash>
-# Lanzar un hostname en 2 nodos
-hpc-login2 ~]$ srun -N2 hostname
-hpc-node1
-hpc-node2
-</code>
-. SBATCH \\
-Sirve para enviar un script al sistema de colas. Es de procesamiento por lotes y no bloqueante.
 <code bash>
 # Crear el script:
-hpc-login2 ~]$ vim trabajo_ejemplo.sh
+hpc-login2 ~]$ vim test_job.sh
     #!/bin/bash
-    #SBATCH --job-name=prueba            # Job name
+    #SBATCH --job-name=test              # Job name
     #SBATCH --nodes=1                    # -N Run all processes on a single node
     #SBATCH --ntasks=1                   # -n Run a single task
@@ Line 328: / Line 336: @@
     #SBATCH --mem=1gb                    # Job memory request
     #SBATCH --time=00:05:00              # Time limit hrs:min:sec
-    #SBATCH --qos=urgent                 # Cola
+    #SBATCH --qos=urgent                 # Queue
-    #SBATCH --output=prueba_%j.log       # Standard output and error log
+    #SBATCH --output=test%j.log          # Standard output and error log
     echo "Hello World!"
-hpc-login2 ~]$ sbatch trabajo_ejemplo.sh
+hpc-login2 ~]$ sbatch test_job.sh
 </code>
+. SALLOC \\
+It is used to immediately obtain an allocation of resources (nodes). As soon as it is obtained, the specified command or a shell is executed.
+<code bash>
+# Get 5 nodes and launch a job.
+hpc-login2 ~]$ salloc -N5 myprogram
+# Get interactive access to a node (Press Ctrl+D to exit):
+hpc-login2 ~]$ salloc -N1
+# Get interactive EXCLUSIVE access to a node
+hpc-login2 ~]$ salloc -N1 --exclusive
+</code>
+. SRUN \\
+It is used to launch a parallel job (preferable to using mpirun). It is interactive and blocking.
+<code bash>
+# Launch the hostname command on 2 nodes
+hpc-login2 ~]$ srun -N2 hostname
+hpc-node1
+hpc-node2
+</code>
-==== Uso de los nodos con GPU ====
+==== GPU use ====
-Para solicitar específicamente una asignación de GPUs para un trabajo hay que añadir a sbatch o srun las opciones:
+To specifically request a GPU allocation for a job, options must be added to sbatch or srun:
-|  %%--gres%%  |  Solicitud de gpus por NODE  |  %%--gres=gpu[[:type]:count],...%%  |
+|  %%--gres%%  |  Request gpus per NODE  |  %%--gres=gpu[[:type]:count],...%%  |
-|  %%--gpus o -G%%  |  Solicitud de gpus por JOB  |  %%--gpus=[type]:count,...%%  |
+|  %%--gpus o -G%%  |  Request gpus per JOB  |  %%--gpus=[type]:count,...%%  |
-También existen las opciones %% --gpus-per-socket,--gpus-per-node y --gpus-per-task%%,\\
+There are also the options %% --gpus-per-socket,--gpus-per-node y --gpus-per-task%%,\\
 Ejemplos:
 <code bash>
-## Ver la lista de nodos y gpus:
+## See the list of nodes and gpus:
 hpc-login2 ~]$ ver_recursos
-## Solicitar 2 GPU cualesquiera para un JOB, añadir:
+## Request any 2 GPUs for a JOB, add:
 --gpus=2
-## Solicitar una A100 de 40G en un nodo y una A100 de 80G en otro, añadir:
+## Request a 40G A100 at one node and an 80G A100 at another node, add:
 --gres=gpu:A100_40:1,gpu:A100_80:1
 </code>
-==== Monitorización de los trabajos ====
+==== Job monitoring ====
 <code bash>
-## Listado de todos los trabajos en la cola
+## List all jobs in the queue
 hpc-login2 ~]$ squeue
-## Listado de los trabajos de un usuario
+## Listing a user's jobs
 hpc-login2 ~]$ squeue -u <login>
-## Cancelar un trabajo:
+## Cancel a job:
 hpc-login2 ~]$ scancel <JOBID>
-## Lista de trabajos recientes
+## List of recent jobs:
 hpc-login2 ~]$ sacct -b
-## Información histórica detallada de un trabajo:
+## Detailed historical information for a job:
 hpc-login2 ~]$ sacct -l -j <JOBID>
-## Información de debug de un trabajo para troubleshooting:
+## Debug information of a job for troubleshooting:
 hpc-login2 ~]$ scontrol show jobid -dd <JOBID>
-## Ver el uso de recursos de un trabajo en ejecución:
+## View the resource usage of a running job:
 hpc-login2 ~]$ sstat <JOBID>
 </code>
-==== Controlar la salida de los trabajos ====
+==== Configure job output ====
-== Códigos de salida ==
+== Exit codes ==
-Por defecto estos son los códigos de salida de los comandos:
+By default these are the output codes of the commands:
 ^  SLURM command  ^  Exit code  ^
-|  salloc  |  0 en caso de éxito, 1 si no se puedo ejecutar el comando del usuario  |
+|  salloc  |  0 success, 1 if the user's command cannot be executed  |
-|  srun  |  El más alto de entre todas las tareas ejecutadas o 253 para un error out-of-mem  |
+|  srun  |  The highest among all executed tasks or 253 for an out-of-mem error.  |
-|  sbatch  |  0 en caso de éxito, si no, el código de salida correspondiente del proceso que falló  |
+|  sbatch  |  0 success, if not, the corresponding exit code of the failed process  |
 == STDIN, STDOUT y STDERR ==
 **SRUN:**\\
-Por defecto stdout y stderr se redirigen de todos los TASKS a el stdout y stderr de srun, y stdin se redirecciona desde el stdin de srun a todas las TASKS. Esto se puede cambiar con:
+By default stdout and stderr are redirected from all TASKS to srun's stdout and stderr, and stdin is redirected from srun's stdin to all TASKS. This can be changed with:
-|  %%-i, --input=<opcion>%%    |
+|  %%-i, --input=<option>%%    |
-|  %%-o, --output=<opcion>%%   |
+|  %%-o, --output=<option>%%   |
-|  %%-e, --error=<opcion>%%   |
+|  %%-e, --error=<option>%%   |
-Y las opciones son:
+And options are:
-  * //all//: opción por defecto.
+  * //all//: by default.
-  * //none//: No se redirecciona nada.
+  * //none//: Nothing is redirected.
-  * //taskid//: Solo se redirecciona desde y/o al TASK id especificado.
+  * //taskid//: Redirects only to and/or from the specified TASK id.
-  * //filename//: Se redirecciona todo desde y/o al fichero especificado.
+  * //filename//: Redirects everything to and/or from the specified file.
-  * //filename pattern//: Igual que filename pero con un fichero definido por un [[ https://slurm.schedmd.com/srun.html#OPT_filename-pattern | patrón ]]
+  * //filename pattern//: Same as the filename option but with a file defined by a [[ https://slurm.schedmd.com/srun.html#OPT_filename-pattern | pattern ]].
 **SBATCH:**\\
-Por defecto "/dev/null" está abierto en el stdin del script y stdout y stderror se redirigen a un fichero de nombre "slurm-%j.out". Esto se puede cambiar con:
+By default "/dev/null" is open in the script's stdin and stdout and stderror are redirected to a file named "slurm-%j.out". This can be changed with:
 |  %%-i, --input=<filename_pattern>%%  |
 |  %%-o, --output=<filename_pattern>%%  |
 |  %%-e, --error=<filename_pattern>%%  |
-La referencia de filename_pattern está [[ https://slurm.schedmd.com/sbatch.html#SECTION_%3CB%3Efilename-pattern%3C/B%3E | aquí ]].
+The reference of filename_pattern is [[ https://slurm.schedmd.com/sbatch.html#SECTION_%3CB%3Efilename-pattern%3C/B%3E | here ]].
-==== Envío de correos ====
+==== Sending mail ====
-Se pueden configurar los JOBS para que envíen correos en determinadas circunstancias usando estos dos parámetros (**SON NECESARIOS AMBOS**):
+JOBS can be configured to send mail in certain circumstances using these two parameters (**BOTH ARE REQUIRED**):
-|  %%--mail-type=<type>%%  |  Opciones: BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_50.  |
+|  %%--mail-type=<type>%%  |  Options: BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_50.  |
-|  %%--mail-user=<user>%%  |  La dirección de correo de destino.  |
+|  %%--mail-user=<user>%%  |  The destination mailing address.  |
-==== Estados de los trabajos en el sistema de colas ====
+==== Status of Jobs in the queuing system ====
 <code bash>
 hpc-login2 ~]# squeue -l
 JOBID PARTITION     NAME     USER      STATE       TIME  NODES NODELIST(REASON)
   defaultPa  example <username>  RUNNING   22:54:55      1 hpc-fat1
+## Check status of queue use:
+hpc-login2 ~]$ estado_colas.sh
+JOBS PER USER:
+--------------
+       usuario.uno:  3
+       usuario.dos:  1
+JOBS PER QOS:
+--------------
+             regular:  3
+                long:  1
+JOBS PER STATE:
+--------------
+             RUNNING:  3
+             PENDING:  1
+==========================================
+Total JOBS in cluster:  4
 </code>
-Estados (STATE) más comunes de un trabajo:
+Common job states:
   * R RUNNING Job currently has an allocation.
   * CD COMPLETED Job has terminated all processes on all nodes with an exit code of zero.
@@ Line 416: / Line 462: @@
   * PD PENDING Job is awaiting resource allocation.
-[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-STATE-CODES | Lista completa de posibles estados de un trabajo ]].\\
+[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-STATE-CODES | Full list of possible job statuses ]].\\
-Si un trabajo no está en ejecución aparecerá una razón debajo de REASON:[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES | Lista de las razones ]] por las que un trabajo puede estar esperando su ejecución.
+If a job is not running, a reason will be displayed underneath REASON:[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES | reason list ]] for which a job may be awaiting execution.