====== GPGPU computation servers======

===== Service description =====
==== Servers with free access GPUs ====
  * ''ctgpgpu4'':
      * PowerEdge R730
      * 2 x  [[https://ark.intel.com/products/92980/Intel-Xeon-Processor-E5-2623-v4-10M-Cache-2_60-GHz|Intel Xeon E52623v4]]
      * 128 GB RAM (4 DDR4 DIMM  2400MHz)
      * 2 x Nvidia GP102GL 24GB [Tesla P40]
      * AlmaLinux 9.1
          * Cuda 12.0
          * **Mandatory use of Slurm queue manager**.

  * HPC cluster servers: [[ en:centro:servizos:hpc | HPC cluster ]]
  * CESGA servers: [[ en:centro:servizos:cesga | Access procedure info ]] 

==== Restricted access GPU servers  ====
 * ''ctgpgpu5'':
      * PowerEdge R730
      * 2 x  [[https://ark.intel.com/products/92980/Intel-Xeon-Processor-E5-2623-v4-10M-Cache-2_60-GHz|Intel Xeon E52623v4]]
      * 128 GB RAM (4 DDR4 DIMM  2400MHz)
      * 2 x Nvidia GP102GL 24GB [Tesla P40]
      * Ubuntu 18.04
          * **Slurm as a mandatory use queue manager**.
          * ** Modules for library version management **.
          * CUDA 11.0
          * OpenCV 2.4 and 3.4
          * Atlas 3.10.3
          * MAGMA
          * TensorFlow
          * Caffee
  
  * ''ctgpgpu6'': 
      * Server SIE LADON 4214
      * 2 processors  [[https://ark.intel.com/content/www/us/en/ark/products/193385/intel-xeon-silver-4214-processor-16-5m-cache-2-20-ghz.html|Intel Xeon Silver 4214]]
      * 192 GB RAM memory(12 DDR4 DIMM 2933MHz) 
      * Nvidia Quadro P6000 24GB (2018)
      * Nvidia Quadro RTX8000 48GB (2019)
      * Operating system Centos 7.7
          * Nvidia Driver 418.87.00 for CUDA 10.1
          * Docker 19.03
          * [[https://github.com/NVIDIA/nvidia-docker | Nvidia-docker  ]]
  * ''ctgpgpu9'':
      * Dell PowerEdge R750
      * 2 x [[ https://ark.intel.com/content/www/es/es/ark/products/215274/intel-xeon-gold-6326-processor-24m-cache-2-90-ghz.html |Intel Xeon Gold 6326 ]]
      * 128 GB RAM 
      * 2 x NVIDIA Ampere A100 80 GB
      * AlmaLinux 8.6
           * NVIDIA 515.48.07 driver and CUDA 11.7
  * ''ctgpgpu10'':
      * PowerEdge R750
      * 2 x [[ https://ark.intel.com/content/www/es/es/ark/products/215272/intel-xeon-gold-5317-processor-18m-cache-3-00-ghz.html |Intel Xeon Gold 5317 ]]
      * 128 GB  RAM 
      * NVIDIA Ampere A100 80 GB
      * Sistema operativo AlmaLinux 8.7
           * Driver NVIDIA 525.60.13 and CUDA 12.0
  * ''ctgpgpu11'':
      * Server Gybabyte  G482-Z54
      * 2 x [[ https://www.amd.com/es/products/cpu/amd-epyc-7413 | AMD EPYC 7413 @2,65 GHz 24c ]]
      * 256 GB RAM
      * 4 x NVIDIA Ampere A100 de 80 GB  
      * AlmaLinux 9.1
           * Driver NVIDIA 520.61.05 and CUDA 11.8
  * ''ctgpgpu12'':
      * Servidor Dell PowerEdge R760
      * 2 x [[ https://ark.intel.com/content/www/xl/es/ark/products/232376.html |Intel Xeon Silver 4410Y ]]
      * 384 GB RAM 
      * 2 x NVIDIA Hopper H100 de 80 GB
      * Sistema operativo AlmaLinux 9.2
           * Driver NVIDIA 535.104.12 and CUDA 12.2

===== Activation =====
Not all servers are available to use freely. Access must be requested filling the [[https://citius.usc.es/dashboard/enviar-incidencia| requests and problem reporting form]]. Users without access permission will receive an incorrect password error message.

===== User Manual =====
==== How to connect the servers ====
Use SSH. Hostnames and ip addresses are:


  * ctgpgpu4.inv.usc.es - 172.16.242.201:22
  * ctgpgpu5.inv.usc.es - 172.16.242.202:22
  * ctgpgpu6.inv.usc.es - 172.16.242.205:22
  * ctgpgpu9.inv.usc.es - 172.16.242.94:22
  * ctgpgpu10.inv.usc.es - 172.16.242.95:22
  * ctgpgpu11.inv.usc.es - 172.16.242.96:22
  * ctgpgpu12.inv.usc.es - 172.16.242.97:22
Connection in only possible from inside the CITIUS network. To connect from other places or from the RAI network it is necessary to use the [[https://wiki.citius.usc.es/en:centro:servizos:vpn:start | VPN]] or the [[https://wiki.citius.usc.es/en:centro:servizos:pasarela_ssh|SSH gateway]].

==== Servers automatic power off ====
The servers switch themselves off after an hour of being idle. To switch them on again use the [[https://wiki.citius.usc.es/en:centro:servizos:acendido_remoto_de_equipos_wake_on_lan|remote power service]].

Servers won't switch themselves off if there is an open SSH or Screen session.

==== Job management with SLURM ====

On servers where there is a queue management software installed its use is mandatory to send jobs and avoid conflicts between different processes because two jobs shouldn't be executed at the same time.

To send a job to the queue command ''srun'' is used:

  srun cuda_program arguments_of_cuda_program

The ''srun'' process waits until the job is executed before returning control to the user. If you don't want to wait a console session manager like ''screen'' can be used. This way you can leave the the job in the queue and disconnect the session without losing the output of the job witch can be recovered any other moment.

Alternatively ''nohup'' can be used and then the job sent to the background with ''&''. This way the output is written in the file ''nohup.out'':

  nohup srun cuda_program cuda_program_arguments &

To check the queue status command ''squeue'' is used. The command shows an output similar to this one:

<code>JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
9  servidore ca_water pablo.qu    PD       0:00      1 (Resources)
10 servidore ca_water pablo.qu    PD       0:00      1 (Priority)
11 servidore ca_water pablo.qu    PD       0:00      1 (Priority)
12 servidore ca_water pablo.qu    PD       0:00      1 (Priority)
13 servidore ca_water pablo.qu    PD       0:00      1 (Priority)
14 servidore ca_water pablo.qu    PD       0:00      1 (Priority)
 8 servidore ca_water pablo.qu     R       0:11      1 ctgpgpu2</code>

An interactive view can be obtained, refreshed every second, with the ''smap'' command:

  smap -i 1