Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision | ||
en:centro:servizos:hpc [2016/05/20 17:50] – [User queues] fernando.guillen | en:centro:servizos:hpc [2023/03/27 10:39] – [Sending a job to the queue system] fernando.guillen | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | FIXME **This page is not fully translated, yet. Please help completing the translation.**\\ //(remove this paragraph once the translation is finished)// | + | ====== High Performance Computing (HPC) cluster ctcomp3 |
+ | [[ https://web.microsoftstream.com/video/f5eba154-b597-4440-9307-3befd7597d78 | Video of the presentation of the service (7/3/22) (Spanish only) ]] | ||
+ | ===== Description ===== | ||
- | ====== High Performance Computing (HPC) ====== | + | The computing part of the cluster is made up of: |
+ | * 9 servers for general computing. | ||
+ | * 1 "fat node" for memory-intensive jobs. | ||
+ | * 4 servers for GPU computing. | ||
+ | |||
+ | Users only have direct access to the login node, which has more limited features and should not be used for computing. \\ | ||
+ | All nodes are interconnected by a 10Gb network. \\ | ||
+ | There is distributed storage accessible from all nodes with 220 TB of capacity connected by a dual 25Gb fibre network. \\ | ||
- | ===== Quick usage instructions ===== | + | \\ |
- | ---------------- | + | ^ Name ^ Model ^ Processor |
- | A summary of the steps necessary to get a job done: | + | | hpc-login2 |
+ | | hpc-node[1-2] | ||
+ | | hpc-node[3-9] | ||
+ | | hpc-fat1 | Dell R840 | ||
+ | | hpc-gpu[1-2] | Dell R740 | ||
+ | | hpc-gpu3 | Dell R7525 | 2 x AMD EPYC 7543 @2,80 GHz (32c) | 256 GB | ||
+ | | hpc-gpu4 | ||
- | - [[ es: | + | ===== Accessing |
- | - [[ es:centro: | + | To access |
- | - [[ es: | + | |
+ | The access is done through an SSH connection to the login node: | ||
+ | <code bash> | ||
+ | ssh < | ||
+ | </ | ||
+ | ===== Storage, directories and filesystems | ||
+ | <note warning> None of the file systems in the cluster are backed up!!!</ | ||
+ | The HOME of the users in the cluster is on the file share system, so it is accessible from all nodes in the cluster. Path defined in the environment variable %%$HOME%%. \\ | ||
+ | Each node has a local 1TB scratch partition, which is deleted at the end of each job. It can be accessed through the %%$LOCAL_SCRATCH%% environment variable in the scripts. \\ | ||
+ | For data to be shared by groups of users, you must request the creation of a folder in the shared storage that will only be accessible by members of the group.\\ | ||
+ | ^ Directory | ||
+ | | Home | %%$HOME%% | ||
+ | | local Scratch | ||
+ | | Group folder | ||
+ | %%* storage is shared %% | ||
+ | === WARNING === | ||
+ | The file share system performs poorly when working with many small files. To improve performance in such scenarios, create a file system in an image file and mount it to work directly on it. The procedure is as follows: | ||
+ | * Create the image file at your home folder: | ||
+ | <code bash> | ||
+ | ## truncate image.name -s SIZE_IN_BYTES | ||
+ | truncate example.ext4 -s 20G | ||
+ | </ | ||
+ | * Create a filesystem in the image file: | ||
+ | <code bash> | ||
+ | ## mkfs.ext4 -T small -m 0 image.name | ||
+ | ## -T small optimized options for small files | ||
+ | ## -m 0 Do not reserve capacity for root user | ||
+ | mkfs.ext4 -T small -m 0 example.ext4 | ||
+ | </ | ||
+ | * Mount the image (using SUDO) with the script | ||
+ | <code bash> | ||
+ | ## By default it is mounted at / | ||
+ | sudo mount_image.py example.ext4 | ||
+ | </ | ||
+ | * To unmount the image use the script // | ||
- | ===== Introduction ===== | + | The mount script has this options: |
- | ------------- | + | < |
- | High Performance Computing | + | --mount-point path <-- (optional) This option creates subdirectories under / |
+ | --rw <-- (optional) By default it is mounted readonly, | ||
+ | </ | ||
+ | <note warning> Do not mount the image file readwrite from more than one node!!!</ | ||
- | A queue management system is a program that plans how and when jobs will execute using the available computational resources. | + | The unmounting script has this options: |
+ | < | ||
+ | --mount-point | ||
+ | </ | ||
+ | ===== Transference of files and data ===== | ||
+ | === SCP === | ||
+ | From your local machine to the cluster: | ||
+ | <code bash> | ||
+ | scp filename < | ||
+ | </ | ||
+ | From the cluster to your local machine: | ||
+ | <code bash> | ||
+ | scp filename < | ||
+ | </ | ||
+ | [[https:// | ||
+ | === SFTP === | ||
+ | To transfer several files or to navigate through the filesystem. | ||
+ | <code bash> | ||
+ | < | ||
+ | sftp> | ||
+ | sftp> ls | ||
+ | sftp> cd < | ||
+ | sftp> put < | ||
+ | sftp> get < | ||
+ | sftp> quit | ||
+ | </ | ||
+ | [[https:// | ||
+ | === RSYNC === | ||
+ | [[ https:// | ||
+ | === SSHFS === | ||
+ | Requires local installation of the sshfs package.\\ | ||
+ | Allows for example to mount the user's local home in hpc-login2: | ||
+ | <code bash> | ||
+ | ## Mount | ||
+ | sshfs < | ||
+ | ## Unmount | ||
+ | fusermount -u < | ||
+ | </ | ||
+ | [[https:// | ||
- | The way these systems work is: | + | ===== Available Software ===== |
- | - The user requests some resources to the queue manager for a computational task. This task is a set of instructions written | + | All nodes have the basic software that is installed by default in AlmaLinux 8.4, in particular: |
- | - The queue manager assigns | + | * GCC 8.5.0 |
- | - When the requested resources | + | * Python 3.6.8 |
+ | * Perl 5.26.3 | ||
+ | GPU nodes, | ||
+ | * nVidia Driver 510.47.03 | ||
+ | * CUDA 11.6 | ||
+ | * libcudnn 8.7 | ||
+ | To use any other software not installed on the system or another version of the system, there are three options: | ||
+ | | ||
+ | - Use a container (uDocker or Apptainer/ | ||
+ | - Use Conda | ||
+ | A module is the simplest solution for using software without modifications or difficult to satisfy dependencies.\\ | ||
+ | A container is ideal when dependencies | ||
+ | Conda is the best solution if you need the latest version of a library or program or packages not otherwise available.\\ | ||
- | It is important to note that the request and the execution of a given task are independent actions that are not resolved atomically. In fact it is usual that the execution | + | ==== Modules/ |
+ | [[ https:// | ||
+ | <code bash> | ||
+ | # See available modules: | ||
+ | module avail | ||
+ | # Module load: | ||
+ | module < | ||
+ | # Unload | ||
+ | module unload < | ||
+ | # List modules loaded in your environment: | ||
+ | module list | ||
+ | # ml can be used as a shorthand | ||
+ | ml avail | ||
+ | # To get info of a module: | ||
+ | ml spider < | ||
+ | </ | ||
- | ==== Hardware description | + | ==== Software containers execution |
+ | === uDocker ==== | ||
+ | [[ https:// | ||
+ | udocker is installed as a module, so it needs to be loaded into the environment: | ||
+ | <code bash> | ||
+ | ml uDocker | ||
+ | </ | ||
- | Ctcomp2 is a heterogeneous cluster, composed of 8 HP Proliant BL685c G7, 5 Dell PowerEdge M910 and 5 Dell PowerEdge M620 nodes. | + | === Apptainer/ |
- | * Each HP Proliant node has 4 AMD Opteron 6262 HE (16 cores) processors and 256 GB RAM(except node1 and the master with 128GB). | + | [[ https:// |
- | * Each Dell PowerEdge M910 node has 2 Intel Xeon L7555 (8 cores, 16 threads) processors and 64 GB RAM. | + | Apptainer/ |
- | * Each Dell PowerEdge M620 node has 2 Intel Xeon E5-2650L (8 cores, 16 threads) processors and 64 GB RAM. | + | |
- | * Connection with the cluster is made at 1Gb but nodes are connected between them by several 10 GbE networks. | + | |
- | ==== Software description | + | ==== CONDA ==== |
- | The job management | + | [[ https:// |
+ | Miniconda | ||
+ | <code bash> | ||
+ | # Getting miniconda | ||
+ | wget https:// | ||
+ | # Install | ||
+ | sh Miniconda3-py39_4.11.0-Linux-x86_64.sh | ||
+ | </ | ||
- | * [[http://docs.adaptivecomputing.com/maui/index.php|MAUI 3.3.1]] | + | ===== Using SLURM ===== |
- | | + | The cluster queue manager is[[ https://slurm.schedmd.com/documentation.html | SLURM ]]. \\ |
- | | + | <note tip>The term CPU identifies a physical core in a socket. Hyperthreading is disabled, so each node has as many CPUs available as (number of sockets) * (number of physical cores per socket) it has.</ |
+ | == Available resources == | ||
+ | <code bash> | ||
+ | hpc-login2 ~]# ver_estado.sh | ||
+ | ============================================================================================================= | ||
+ | | ||
+ | ============================================================================================================= | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | hpc-gpu4 up 1%[|-------------------------------------------------]( 1/64) RAM: 35% | ||
+ | | ||
+ | | ||
+ | hpc-node3 | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | ============================================================================================================= | ||
+ | TOTALES: [Cores : 3/688] [Mem(MB): 270000/ | ||
+ | hpc-login2 ~]$ sinfo -e -o " | ||
+ | # There is an alias for that command: | ||
+ | hpc-login2 ~]$ ver_recursos | ||
+ | NODELIST | ||
+ | hpc-fat1 | ||
+ | hpc-gpu[1-2] | ||
+ | hpc-gpu3 | ||
+ | hpc-gpu4 | ||
+ | hpc-node[1-2] 36 187645 | ||
+ | hpc-node[3-9] 48 187645 | ||
- | ===== User queues | + | # To see current resource use: (CPUS (Allocated/ |
- | ------------- | + | hpc-login2 ~]$ sinfo -N -r -O NodeList, |
+ | # There is an alias for that command: | ||
+ | hpc-login2 ~]$ ver_uso | ||
+ | NODELIST | ||
+ | hpc-fat1 | ||
+ | hpc-gpu3 | ||
+ | hpc-gpu4 | ||
+ | hpc-node1 | ||
+ | hpc-node2 | ||
+ | hpc-node3 | ||
+ | hpc-node4 | ||
+ | hpc-node5 | ||
+ | hpc-node6 | ||
+ | hpc-node7 | ||
+ | hpc-node8 | ||
+ | hpc-node9 | ||
+ | </ | ||
+ | ==== Nodes ==== | ||
+ | A node is SLURM' | ||
+ | <code bash> | ||
+ | # Show node info: | ||
+ | hpc-login2 ~]$ scontrol show node hpc-node1 | ||
+ | NodeName=hpc-node1 Arch=x86_64 CoresPerSocket=18 | ||
+ | CPUAlloc=0 CPUTot=36 CPULoad=0.00 | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | </ | ||
+ | ==== Partitions ==== | ||
+ | Partitions in SLURM are logical groups of nodes. In the cluster there is a single partition to which all nodes belong, so it is not necessary to specify it when submitting jobs. | ||
+ | <code bash> | ||
+ | # Show partition info: | ||
+ | hpc-login2 ~]$ sinfo | ||
+ | defaultPartition* | ||
+ | </ | ||
+ | ==== Jobs ==== | ||
+ | Jobs in SLURM are resource allocations to a user for a given time. Jobs are identified by a sequential number or JOBID. \\ | ||
+ | A JOB consists of one or more STEPS, each consisting of one or more TASKS that use one or more CPUs. There is one STEP for each program that executes sequentially in a JOB and there is one TASK for each program that executes in parallel. Therefore in the simplest case such as launching a job consisting of executing the hostname command the JOB has a single STEP and a single TASK. | ||
- | There are four user and eight system | + | ==== Queue system |
+ | The queue to which each job is submitted defines the priority, the limits and also the relative " | ||
+ | <code bash> | ||
+ | # Show queues | ||
+ | hpc-login2 ~]$ sacctmgr show qos | ||
+ | # There is an alias that shows only the relevant info: | ||
+ | hpc-login2 ~]$ ver_colas | ||
+ | Name Priority | ||
+ | ---------- | ||
+ | | ||
+ | interactive | ||
+ | urgent | ||
+ | long | ||
+ | | ||
+ | | ||
+ | | ||
+ | </ | ||
+ | # Priority: is the relative priority | ||
+ | # DenyonLimit: | ||
+ | # UsageFactor: | ||
+ | # MaxTRES: limnits applied | ||
+ | # MaxWall: maximum time the job can run \\ | ||
+ | # MaxTRESPU: global limits per user \\ | ||
+ | # MaxJobsPU: Maximum number of jobs a user can have running simultaneously. \\ | ||
+ | # MaxSubmitPU: | ||
+ | |||
+ | ==== Sending a job to the queue system ==== | ||
+ | == Requesting resources == | ||
+ | By default, if you submit a job without specifying anything, the system submits it to the default (regular) QOS and assigns it a node, a CPU and 4 GB. The time limit for job execution is that of the queue (4 days and 4 hours). | ||
+ | This is very inefficient, | ||
+ | - %%Node number (-N or --nodes), tasks (-n or --ntasks) and/or CPUs per task (-c or --cpus-per-task).%% | ||
+ | - %%Memory (--mem) per node or memory per cpu (--mem-per-cpu).%% | ||
+ | - %%Job execution time ( --time )%% | ||
- | Independently of the type of queue used for job submissions, an user can only specify | + | In addition, it may be interesting to add the following parameters: |
- | Therefore for jobs in which both memory | + | | -J |
- | __To execute jobs that don't adjust to queue parameters get in touch with the IT department.__ | + | | -q |
+ | | -o | ||
+ | | | ||
+ | | -C | ||
+ | | | %%--exclusive%% | ||
+ | | -w | %%--nodelist%% | ||
- | User queues | + | == How resources |
- | | + | The default allocation method between nodes is block allocation (all available cores on a node are allocated before using another node). The default |
- | | + | |
- | < | + | == Priority calculation == |
- | ct$ qsub -q short script.sh | + | When a job is submitted |
+ | If resources are available, the job is executed directly, but if not, it is queued. Each job is assigned a priority that determines the order in which the jobs in the queue are executed when resources | ||
+ | The fairshare is a dynamic calculation made by SLURM for each user and is the difference between | ||
+ | < | ||
+ | hpc-login2 ~]$ sshare | ||
+ | User RawShares | ||
+ | ---------- ---------- ----------- ----------- ----------- | ||
+ | | ||
+ | 1 0.500000 | ||
+ | user_name | ||
</ | </ | ||
- | * '' | + | # RawShares: Is the amount of resources allocated to the user in absolute terms . It is the same for all users.\\ |
- | < | + | # NormShares: |
- | ct$ qsub -q bigmem script.sh | + | # RawUsage: The number |
+ | # NormUsage: RawUsage normalised to total seconds/cpu consumed in the cluster.\\ | ||
+ | # FairShare: The FairShare factor between 0 and 1. The higher the cluster usage, the closer | ||
+ | |||
+ | == Job submission == | ||
+ | - sbatch | ||
+ | - salloc | ||
+ | - srun | ||
+ | |||
+ | 1. SBATCH \\ | ||
+ | Used to send a script | ||
+ | < | ||
+ | # Crear el script: | ||
+ | hpc-login2 ~]$ vim test_job.sh | ||
+ | # | ||
+ | # | ||
+ | #SBATCH --nodes=1 | ||
+ | #SBATCH --ntasks=1 | ||
+ | #SBATCH --cpus-per-task=1 | ||
+ | #SBATCH --mem=1gb | ||
+ | #SBATCH --time=00: | ||
+ | #SBATCH --qos=urgent | ||
+ | #SBATCH --output=test%j.log | ||
+ | |||
+ | echo "Hello World!" | ||
+ | |||
+ | hpc-login2 ~]$ sbatch test_job.sh | ||
</ | </ | ||
- | * '' | + | 2. SALLOC \\ |
- | < | + | It is used to immediately obtain an allocation of resources (nodes). As soon as it is obtained, the specified command or a shell is executed. |
- | ct$ qsub -q interactive | + | <code bash> |
+ | # Get 5 nodes and launch | ||
+ | hpc-login2 ~]$ salloc -N5 myprogram | ||
+ | # Get interactive | ||
+ | hpc-login2 ~]$ salloc -N1 | ||
+ | </code> | ||
+ | 3. SRUN \\ | ||
+ | It is used to launch a parallel | ||
+ | < | ||
+ | # Launch the hostname command on 2 nodes | ||
+ | hpc-login2 ~]$ srun -N2 hostname | ||
+ | hpc-node1 | ||
+ | hpc-node2 | ||
</ | </ | ||
- | The system queues are '' | ||
- | * '' | ||
- | * '' | ||
- | * '' | ||
- | * '' | ||
- | * '' | ||
- | * '' | ||
- | * '' | ||
- | * '' | ||
- | The following table summarizes | + | ==== GPU use ==== |
+ | To specifically request a GPU allocation for a job, options must be added to sbatch or srun: | ||
+ | | %%--gres%% | ||
+ | | %%--gpus o -G%% | Request gpus per JOB | %%--gpus=[type]: | ||
+ | There are also the options %% --gpus-per-socket, | ||
+ | Ejemplos: | ||
+ | <code bash> | ||
+ | ## See the list of nodes and gpus: | ||
+ | hpc-login2 ~]$ ver_recursos | ||
+ | ## Request any 2 GPUs for a JOB, add: | ||
+ | --gpus=2 | ||
+ | ## Request a 40G A100 at one node and an 80G A100 at another node, add: | ||
+ | --gres=gpu: | ||
+ | </ | ||
- | ^ Queue ^ Limits | ||
- | | ::: ^ Processes | ||
- | | '' | ||
- | | '' | ||
- | | '' | ||
- | | '' | ||
- | | '' | ||
- | | '' | ||
- | | '' | ||
- | | '' | ||
- | | '' | ||
- | | '' | ||
- | | '' | ||
- | | '' | ||
+ | ==== Job monitoring ==== | ||
+ | <code bash> | ||
+ | ## List all jobs in the queue | ||
+ | hpc-login2 ~]$ squeue | ||
+ | ## Listing a user's jobs | ||
+ | hpc-login2 ~]$ squeue -u < | ||
+ | ## Cancel a job: | ||
+ | hpc-login2 ~]$ scancel < | ||
+ | ## List of recent jobs: | ||
+ | hpc-login2 ~]$ sacct -b | ||
+ | ## Detailed historical information for a job: | ||
+ | hpc-login2 ~]$ sacct -l -j < | ||
+ | ## Debug information of a job for troubleshooting: | ||
+ | hpc-login2 ~]$ scontrol show jobid -dd < | ||
+ | ## View the resource usage of a running job: | ||
+ | hpc-login2 ~]$ sstat < | ||
+ | </ | ||
+ | ==== Configure job output ==== | ||
+ | == Exit codes == | ||
+ | By default these are the output codes of the commands: | ||
+ | ^ SLURM command | ||
+ | | salloc | ||
+ | | srun | The highest among all executed tasks or 253 for an out-of-mem error. | ||
+ | | sbatch | ||
+ | |||
+ | == STDIN, STDOUT y STDERR == | ||
+ | **SRUN:**\\ | ||
+ | By default stdout and stderr are redirected from all TASKS to srun's stdout and stderr, and stdin is redirected from srun's stdin to all TASKS. This can be changed with: | ||
+ | | %%-i, --input=< | ||
+ | | %%-o, --output=< | ||
+ | | %%-e, --error=< | ||
+ | And options are: | ||
+ | * //all//: by default. | ||
+ | * //none//: Nothing is redirected. | ||
+ | * //taskid//: Redirects only to and/or from the specified TASK id. | ||
+ | * // | ||
+ | * //filename pattern//: Same as the filename option but with a file defined by a [[ https:// | ||
+ | |||
+ | **SBATCH: | ||
+ | By default "/ | ||
+ | | %%-i, --input=< | ||
+ | | %%-o, --output=< | ||
+ | | %%-e, --error=< | ||
+ | The reference of filename_pattern is [[ https:// | ||
+ | |||
+ | ==== Sending mail ==== | ||
+ | JOBS can be configured to send mail in certain circumstances using these two parameters (**BOTH ARE REQUIRED**): | ||
+ | | %%--mail-type=< | ||
+ | | %%--mail-user=< | ||
+ | |||
+ | |||
+ | |||
+ | ==== Status of Jobs in the queuing system ==== | ||
+ | <code bash> | ||
+ | hpc-login2 ~]# squeue -l | ||
+ | JOBID PARTITION | ||
+ | 6547 defaultPa | ||
+ | |||
+ | ## Check status of queue use: | ||
+ | hpc-login2 ~]$ estado_colas.sh | ||
+ | JOBS PER USER: | ||
+ | -------------- | ||
+ | | ||
+ | | ||
+ | |||
+ | JOBS PER QOS: | ||
+ | -------------- | ||
+ | | ||
+ | long: 1 | ||
+ | |||
+ | JOBS PER STATE: | ||
+ | -------------- | ||
+ | | ||
+ | | ||
+ | ========================================== | ||
+ | Total JOBS in cluster: | ||
+ | </ | ||
+ | Common job states: | ||
+ | * R RUNNING Job currently has an allocation. | ||
+ | * CD COMPLETED Job has terminated all processes on all nodes with an exit code of zero. | ||
+ | * F FAILED Job terminated with non-zero exit code or other failure condition. | ||
+ | * PD PENDING Job is awaiting resource allocation. | ||
+ | |||
+ | [[ https:// | ||
+ | |||
+ | If a job is not running, a reason will be displayed underneath REASON:[[ https:// | ||
- | * Processes: Maximum number of processes by job in this queue.Número máximo de procesos por trabajo en esta cola. | ||
- | * Nodes: Número máximo de nodos en los que se ejecutará el trabajo en esta cola. | ||
- | * Memory: Cantidad de memoria virtual máxima usada de modo concurrente por todos los procesos del trabajo. | ||
- | * Jobs/usuer: Número máximo de trabajos por usuario en esta cola. Es independiente del estado de dichos trabajos. | ||
- | * Maximum time (hours): tiempo real máximo durante el que el trabajo puede estar en ejecución. | ||
- | * Prioridad: Prioridad de la cola de ejecución frente a las otras. Un valor más alto expresa una mayor prioridad. Nótese que esto implica que ante la falta de otros criterios, cualquier trabajo enviado con qsub sin definir parámetros se ejecutará en np1 con los límites de dicha cola. |