====== Cluster CIQUS ====== === Localización en el CPD === {{ :sysadm-private:clusterciqus:rc2_-_new_page.png?nolink&800x1200 |}} ===== Cluster de Berta ===== ==== Servidores ==== ^ Nombre ^ Service Tag ^ Modelo ^ IP ^ Notas ^ | dell | 1BS314J | Dell 2950 | eth0:192.168.0.254/24 192.168.1.252/24 eth1:172.16.247.180/24 | | | dell1 | D33414J | Dell 2950 | eth0:192.168.0.101/24 eth1: | | | dell2 | F33414J | Dell 2950 | eth0:192.168.0.102/24 eth1: | | | dell3 | 233414J | Dell 2950 | eth0:192.168.0.103/24 eth1: | | | dell4 | H33414J | Dell 2950 | eth0:192.168.0.104/24 eth1: | | | dell5 | J33414J | Dell 2950 | eth0:192.168.0.105/24 eth1: | | | dell6 | 733414J | Dell 2950 | eth0:192.168.0.106/24 eth1: | | | dell7 | 633414J | Dell 2950 | eth0:192.168.0.107/24 eth1: | | | dell8 | C23414J | Dell 2950 | eth0:192.168.0.108/24 eth1: | | | dell9 | 433414J | Dell 2950 | eth0:192.168.0.109/24 eth1: | | | dell10 | H23414J | Dell 1950 | eth0:192.168.0.110/24 eth1: | | | dell11 | JHB714J | Dell 1950 | eth0:192.168.0.111/24 eth1: | | | dell12 | B33414J | Dell 1950 | eth0:192.168.0.112/24 eth1: | | | dell13 | G23414J | Dell 1950 | eth0:192.168.0.113/24 eth1: | | | dell14 | 243414J | Dell 1950 | eth0:192.168.0.114/24 eth1: | Error: E171F PCIE Fatal Err B0 DE F0.Sistema de archivos xfs de sda2 cascado. Da problemas con el sistema de archivos? | | dell15 | F23414J | Dell 1950 | eth0:192.168.0.115/24 eth1: | | | dell16 | 933414J | Dell 1950 | eth0:192.168.0.116/24 eth1: | Mal passwd de root. Error en config de red. | | dell17 | B0NQX4J | Dell 1950 | eth0:192.168.0.117/24 eth1: | | | dell18 | D0NQX4J | Dell 1950 | eth0:192.168.0.118/24 eth1: | Mal passwd de root. Error en config de red. | | dell19 | C0NQX4J | Dell 1950 | eth0:192.168.0.119/24 eth1: | | | dell20 | BCTQX4J | Dell 1950 | eth0:192.168.0.120/24 eth1: | | | dell21 | 6RBGY4J | Dell 1950 | eth0:192.168.0.121/24 eth1: | Mal passwd de root. Error en config de red. | | dell22 | 3RBGY4J | Dell 1950 | eth0:192.168.0.122/24 eth1: | | | dell23 | 4RBGY4J | Dell 1950 | eth0:192.168.0.123/24 eth1: | Mal passwd de root. Error en config de red. | | dell24 | 2RBGY4J | Dell 1950 | eth0:192.168.0.124/24 eth1: | | | dell25 | 5RBGY4J | Dell 1950 | eth0:192.168.0.125/24 eth1: | | | dell26 | 7RBGY4J | Dell 1950 | eth0:192.168.0.126/24 eth1: | | | dell27 | CL3W102 | R420 | eth0:192.168.0.127/24 eth1: | Error: PCIe training error. Se cambia la placa. | | dell28 | 7M3W102 | R420 | eth0:192.168.0.128/24 eth1: | | ==== Master ==== El SO es Scientific Linux SL release 5.2 (Boron). \\ La ip de acceso externa es la 172.16.247.180 en eth1 (en el CIQUS era la 172.16.249.180) tiene definido IP SNAT 193.144.81.138 (accesible 22/tcp (ssh), para todos). \\ La red interna del cluster es la 192.168.0.0/24 en eth0. \\ La red 192.168.1.252 en eth0 debe ser un resto de otro cluster, no se usa para nada aunque hay referencia a un servidor dns en esa red en el resolv.conf. \\ PARTICIONES: Filesystem Size Used Avail Use% Mounted on /dev/sda1 7.8G 4.2G 3.2G 57% / /dev/mapper/VG0-home 296G 140G 141G 50% /home tmpfs 3.9G 0 3.9G 0% /dev/sh IPTABLES: #*nat #:PREROUTING ACCEPT [181641:15178573] #:POSTROUTING ACCEPT [45392:2667940] #:OUTPUT ACCEPT [45392:2667940] #-A POSTROUTING -s 192.168.0.0/255.255.255.0 -o eth0 -j SNAT --to-source 172.16.249.180 # la regla anterior no funcionaba y la cambie por la siguiente #-A POSTROUTING -j MASQUERADE #COMMIT *filter :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [0:0] :RH-Firewall-1-INPUT - [0:0] -A INPUT -j RH-Firewall-1-INPUT #-A FORWARD -j RH-Firewall-1-INPUT # Red interna -A RH-Firewall-1-INPUT -s 192.168.0.0/24 -i eth0 -j ACCEPT # Red BMC -A RH-Firewall-1-INPUT -s 192.168.1.0/24 -i eth0 -j ACCEPT # qfqcpc06 -A RH-Firewall-1-INPUT -s 193.144.87.45 -p tcp --dport 22 -j ACCEPT # gdrqcluster -A RH-Firewall-1-INPUT -s 193.144.87.51 -p tcp --dport 22 -j ACCEPT # gdrqcluster32 -A RH-Firewall-1-INPUT -s 193.144.87.53 -p tcp --dport 22 -j ACCEPT # gdrqcluster64 -A RH-Firewall-1-INPUT -s 193.144.87.52 -p tcp --dport 22 -j ACCEPT # IP casa Javier -A RH-Firewall-1-INPUT -s 83.165.109.52 -p tcp --dport 22 -j ACCEPT # qfqcpc05 -A RH-Firewall-1-INPUT -s 193.144.87.44 -p tcp --dport 22 -j ACCEPT # pcsistema1.cesga.es -A RH-Firewall-1-INPUT -s 193.144.44.144 -p tcp --dport 22 -j ACCEPT # # Proteccion contra ataques ssh # # Si el la conexion es nueva, se anade a la lista SSH_LIST. -A RH-Firewall-1-INPUT -p tcp -m tcp --dport 22 -m state --state NEW -m recent --set --name SSH_LIST --rsource # Actualizamos la lista de conexiones nuevas, quedandonos sólo con las entradas de los últimos 60 seg. # Si en esa lista hay alguna conexión que se haya intentado mas de 3 veces, hacemos DROP -A RH-Firewall-1-INPUT -p tcp -m tcp --dport 22 -m state --state NEW -m recent --update --seconds 60 --hitcount 3 --name SSH_LIST --rsource -j DROP # # Si no es un ataque aceptar ssh desde el exterior -A RH-Firewall-1-INPUT -p tcp --dport 22 -j ACCEPT # -A RH-Firewall-1-INPUT -i lo -j ACCEPT -A RH-Firewall-1-INPUT -p icmp --icmp-type any -j ACCEPT -A RH-Firewall-1-INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT # DEFAULT REJECT -A RH-Firewall-1-INPUT -j REJECT --reject-with icmp-host-prohibited SELINUX desactivado. TCP Wrappers activado: sshd: 172.16.64.75 , 172.16.243. , 172.16.247. , 172.16.249. El hosts.deny esta lleno de ips porque hay un servicio llamado denyhosts que añade direcciones ip de forma dinamica para prevenir ataques. 127.0.0.1 localhost localhost.localdomain # For the nodes we need this !!!!!!!!!!!!!!!!!!! # uncomment it for the nodes and comment the above line #127.0.0.1 payne localhost 192.168.0.1 max.cluster.org max 192.168.0.6 payne5 192.168.0.7 payne6 192.168.0.8 payne7 192.168.0.9 payne8 192.168.0.10 payne9 # for (( i=46; i<60; i++ )); do echo -e "192.168.0.$[i+1]\tpayne$i"; done # Dell # For the future: # The new node to be installed #192.168.0.254 nuevo # Los ordenadores de Quimica #193.144.87.30 qfqcpc02 #193.144.74.229 qfqcpc02.usc.es qfqcpc02 #193.144.74.199 qfbef01.usc.es qfbef01 # Los servidores que tienen Debian 193.146.38.146 toxo.com.uvigo.es 130.206.1.5 ftp.rediris.es 194.109.137.218 security.debian.org # El otro cluster 193.144.87.51 cluster2 # The following lines are desirable for IPv6 capable hosts # (added automatically by netbase upgrade) ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters ff02::3 ip6-allhosts 192.168.0.2 payne1 192.168.0.3 payne2 192.168.0.4 payne3 192.168.0.5 payne4 192.168.0.254 dell 192.168.0.11 payne10 192.168.0.12 payne11 192.168.0.13 payne12 192.168.0.14 payne13 192.168.0.15 payne14 192.168.0.16 payne15 192.168.0.17 payne16 192.168.0.18 payne17 192.168.0.19 payne18 192.168.0.20 payne19 192.168.0.21 payne20 192.168.0.22 payne21 192.168.0.23 payne22 192.168.0.24 payne23 192.168.0.25 payne24 192.168.0.26 payne25 192.168.0.27 payne26 192.168.0.28 payne27 192.168.0.29 payne28 192.168.0.30 payne29 192.168.0.31 payne30 192.168.0.32 payne31 192.168.0.33 payne32 192.168.0.34 payne33 192.168.0.35 payne34 192.168.0.36 payne35 192.168.0.37 payne36 192.168.0.38 payne37 192.168.0.39 payne38 192.168.0.40 payne39 192.168.0.41 payne40 192.168.0.42 payne41 192.168.0.43 payne42 192.168.0.44 payne43 192.168.0.45 payne44 192.168.0.46 payne45 192.168.0.101 dell1 192.168.0.102 dell2 192.168.0.103 dell3 192.168.0.104 dell4 192.168.0.105 dell5 192.168.0.106 dell6 192.168.0.107 dell7 192.168.0.108 dell8 192.168.0.109 dell9 192.168.0.110 dell10 192.168.0.111 dell11 192.168.0.112 dell12 192.168.0.113 dell13 192.168.0.114 dell14 192.168.0.115 dell15 192.168.0.116 dell16 192.168.0.117 dell17 192.168.0.118 dell18 192.168.0.119 dell19 192.168.0.120 dell20 192.168.0.121 dell21 192.168.0.122 dell22 192.168.0.123 dell23 192.168.0.124 dell24 192.168.0.125 dell25 192.168.0.126 dell26 192.168.0.127 dell27 192.168.0.128 dell28 192.168.0.129 dell29 192.168.0.130 dell30 192.168.0.131 dell31 192.168.0.132 dell32 192.168.0.133 dell33 192.168.0.134 dell34 192.168.0.135 dell35 192.168.0.136 dell36 192.168.0.137 dell37 192.168.0.138 dell38 192.168.0.139 dell39 192.168.0.140 dell40 192.168.0.141 dell41 192.168.0.142 dell42 192.168.0.143 dell43 192.168.0.144 dell44 192.168.0.145 dell45 192.168.0.146 dell46 192.168.0.147 dell47 192.168.0.148 dell48 192.168.0.149 dell49 Usuarios: javier:x:1000:1000:javier,,,:/home/javier:/bin/bash uscqfjcf:x:1001:1001:Jose Luis,,,:/home/uscqfjcf:/bin/bash tbp:x:1002:1002:Thomas Bondo Pedersen,,,:/home/tbp:/bin/tcsh berta:x:1003:1003:Berta Fernandez Rodriguez,,,:/home/berta:/bin/bash domenico:x:1004:1004:Domenico,,,:/home/domenico:/bin/bash snfhko:x:1005:100::/home/snfhko:/bin/bash cristian:x:1006:1006:Cristian,,,:/home/cristian:/bin/bash jonathan:x:1007:1007:Jonathan,,,:/home/jonathan:/bin/bash alfredo:x:1008:1008:Alfredo Sanchez de Meras,,,:/home/alfredo:/bin/bash ganglia:x:102:101:Ganglia Monitor:/var/lib/ganglia:/bin/false siham:x:1009:1009:Siham Naima Derrar,1,,:/home/siham:/bin/bash stefan:x:1010:1010:Stefan Bilan,,,:/home/stefan:/bin/bash juanpablo:x:1011:1011:,,,:/home/juanpablo:/bin/tcsh silvia:x:1012:1012:,,,:/home/silvia:/bin/bash hubert:x:1013:1013:Hubert Cybulski,,,:/home/hubert:/bin/bash angelika:x:1014:1014:Angelika,Quimica Cuantica,,:/home/angelika:/bin/bash REPOS: /etc/yum.repos.d/adobe.repo name=Adobe Systems Incorporated baseurl=http://linuxdownload.adobe.com/linux/i386/ -- /etc/yum.repos.d/atrpms.repo name=ATrpms rpms baseurl=http://ftp.scientificlinux.org/linux/extra/atrpms/sl5-$basearch/stable -- /etc/yum.repos.d/atrpms.repo.rpmnew name=ATrpms rpms baseurl=http://ftp.scientificlinux.org/linux/extra/atrpms/sl5-$basearch/stable -- /etc/yum.repos.d/dag.repo name=DAG rpms baseurl=http://ftp.scientificlinux.org/linux/extra/dag/redhat/el5/en/$basearch/dag/ -- /etc/yum.repos.d/dag.repo.rpmnew name=DAG rpms baseurl=http://ftp.scientificlinux.org/linux/extra/dag/redhat/el5/en/$basearch/dag/ -- /etc/yum.repos.d/epel.repo name=Extra Packages for Enterprise Linux 5 - $basearch baseurl=http://download.fedoraproject.org/pub/epel/5/$basearch -- /etc/yum.repos.d/epel.repo name=Extra Packages for Enterprise Linux 5 - $basearch - Debug baseurl=http://download.fedoraproject.org/pub/epel/5/$basearch/debug -- /etc/yum.repos.d/epel.repo name=Extra Packages for Enterprise Linux 5 - $basearch - Source baseurl=http://download.fedoraproject.org/pub/epel/5/SRPMS -- /etc/yum.repos.d/epel-testing.repo name=Extra Packages for Enterprise Linux 5 - Testing - $basearch baseurl=http://download.fedoraproject.org/pub/epel/testing/5/$basearch -- /etc/yum.repos.d/epel-testing.repo name=Extra Packages for Enterprise Linux 5 - Testing - $basearch - Debug baseurl=http://download.fedoraproject.org/pub/epel/testing/5/$basearch/debug -- /etc/yum.repos.d/epel-testing.repo name=Extra Packages for Enterprise Linux 5 - Testing - $basearch - Source baseurl=http://download.fedoraproject.org/pub/epel/testing/5/SRPMS -- /etc/yum.repos.d/sl-contrib.repo name=Scientific Linux 5 contrib area baseurl=http://ftp.scientificlinux.org/linux/scientific/52/$basearch/contrib -- /etc/yum.repos.d/sl-debuginfo.repo name=Scientific Linux 5 debuginfo rpm's baseurl=http://ftp.scientificlinux.org/linux/scientific/5x/archive/debuginfo -- /etc/yum.repos.d/sl-fastbugs.repo name=SL 5 fastbugs area baseurl=http://ftp.scientificlinux.org/linux/scientific/52/$basearch/updates/fastbugs -- /etc/yum.repos.d/sl.repo name=SL 5 base #baseurl=http://ftp.scientificlinux.org/linux/scientific/52/$basearch/SL # ftp://ftp.scientificlinux.org/linux/scientific/52/$basearch/SL baseurl=http://linuxsoft.cern.ch/scientific/52/$basearch/SL -- /etc/yum.repos.d/sl.repo.rpmnew name=SL 5 base baseurl=http://ftp.scientificlinux.org/linux/scientific/52/$basearch/SL -- /etc/yum.repos.d/sl-security.repo name=SL 5 security updates baseurl=http://ftp.scientificlinux.org/linux/scientific/52/$basearch/updates/security -- /etc/yum.repos.d/sl-srpms.repo name=Scientific Linux 5 source rpm's (src.rpm) baseurl=http://ftp.scientificlinux.org/linux/scientific/5x/SRPMS -- /etc/yum.repos.d/sl-testing.repo name=Scientific Linux 5 testing area baseurl=http://ftp.scientificlinux.org/linux/scientific/5rolling/testing/$basearch La version de torque es la 2.0.0 y la de SGE 6.0u7. ==== Servicios ==== === DHCP === Se sirven todas las direcciones de los nodos de la red 192.168.0.0/24 por dhcp filtrados por mac: subnet 192.168.0.0 netmask 255.255.255.0 { # default gateway option routers 192.168.0.254; option subnet-mask 255.255.255.0; option domain-name "cluster.pri"; option domain-name-servers 193.144.75.9, 192.168.0.254; # range dynamic-bootp 192.168.0.2 192.168.0.254; range 192.168.0.201 192.168.0.250; default-lease-time 600; max-lease-time 7200; } y también está configurado para el PXE: # Needed for PXE (taken from the RHEL-3 sysadmin-guide) allow booting; allow bootp; class "pxeclients" { match if substring(option vendor-class-identifier, 0, 9) = "PXEClient"; next-server 192.168.0.254; filename "pxelinux.0"; } === TFTP server === Necesario para PXE service tftp { socket_type = dgram protocol = udp wait = yes user = root server = /usr/sbin/in.tftpd server_args = -s /tftpboot disable = no per_source = 11 cps = 100 2 flags = IPv4 } === NFS === /home 192.168.0.0/255.255.255.0(rw,no_root_squash) En /home están los homes de los usuarios, en /home/cluster el software para instalar y en /home/opt software ya instalado? === NIS === Hay un NIS server funcionando. passwd: files nis shadow: files nis group: files nis hosts: files dns bootparams: nisplus [NOTFOUND=return] files ethers: files netmasks: files networks: files protocols: files rpc: files services: files netgroup: files publickey: nisplus automount: files aliases: files nisplus sudoers: files ldap ypserver 192.168.0.254 # # ypserv.conf In this file you can set certain options for the NIS server, # and you can deny or restrict access to certain maps based # on the originating host. # # See ypserv.conf(5) for a description of the syntax. # # Some options for ypserv. This things are all not needed, if # you have a Linux net. # Should we do DNS lookups for hosts not found in the hosts table ? # This option is ignored in the moment. dns: no # How many map file handles should be cached ? files: 30 # Should we register ypserv with SLP ? slp: no # After how many seconds we should re-register ypserv with SLP ? slp_timeout: 3600 # xfr requests are only allowed from ports < 1024 xfr_check_port: yes # The following, when uncommented, will give you shadow like passwords. # Note that it will not work if you have slave NIS servers in your # network that do not run the same server as you. # Host : Domain : Map : Security # # * : * : passwd.byname : port # * : * : passwd.byuid : port # Not everybody should see the shadow passwords, not secure, since # under MSDOG everbody is root and can access ports < 1024 !!! * : * : shadow.byname : port * : * : passwd.adjunct.byname : port # If you comment out the next rule, ypserv and rpc.ypxfrd will # look for YP_SECURE and YP_AUTHDES in the maps. This will make # the security check a little bit slower, but you only have to # change the keys on the master server, not the configuration files # on each NIS server. # If you have maps with YP_SECURE or YP_AUTHDES, you should create # a rule for them above, that's much faster. # * : * : * : none === FTP === Hay un VSFTP funcionando.Parece que sirve /var/ftp/ * pub * SL52 * SL53 * SL55 anonymous_enable=YES local_enable=YES write_enable=YES local_umask=022 dirmessage_enable=YES xferlog_enable=YES connect_from_port_20=YES xferlog_std_format=YES listen=YES pam_service_name=vsftpd userlist_enable=YES tcp_wrappers=YES === ntp === # Hosts on local network are less restricted. restrict 192.168.0.0 mask 255.255.255.0 nomodify notrap server hora.rediris.es server 0.rhel.pool.ntp.org server 1.rhel.pool.ntp.org server 2.rhel.pool.ntp.org === httpd === Solo está para ejecutar un script cgi que tiene algo que ver con el proceso de instalacion de pxe en /var/www ????? === pbs y sge === Se ejecutan ambos. root 3667 0.0 0.0 102512 5036 ? Sl Mar12 2:01 /opt/cluster/sge60/bin/lx24-amd64/sge_qmaster root 3687 0.2 0.0 50032 4592 ? Sl Mar12 7:04 /opt/cluster/sge60/bin/lx24-amd64/sge_schedd root 4086 0.0 0.0 6352 568 ? Ss Mar12 0:00 /usr/sbin/pbs_server root 4099 0.0 0.0 6056 316 ? Ss Mar12 0:00 /usr/sbin/pbs_sched El sge esta en: /home/opt/cluster/sge60/ por lo que está compartido con los nodos. PBS en : /var/lib/torque/ === PGI workstation === Se lanza un servidor de licencias para [[http://en.wikipedia.org/wiki/The_Portland_Group|PGI]] que parece ser un conjunto de compiladores. root 4268 0.0 0.0 16360 1428 ? S Mar12 0:00 /home/opt/pgi/linux86-64/13.4/bin/lmgrd -c /opt/pgi/license.dat -l /opt/pgi/flexlm.log root 4270 0.0 0.0 50544 2856 ? Ssl Mar12 0:01 pgroupd -T localhost 11.11 3 -c /opt/pgi/license.dat --lmgrd_start 53201b25 === Añadidos posteriores === Instalado [[ http://172.16.247.180/ganglia/| Ganglia ]] en todo el cluster. Un script que permite apagar y encender los nodos desde el master: /usr/local/bin/gestionar_nodos ==== Nodos ==== Los nodos tienen todos Scientific Linux SL release 5.3 (Boron) Filesystem Size Used Avail Use% Mounted on /dev/sda1 7.8G 1.7G 5.8G 23% / tmpfs 7.9G 0 7.9G 0% /dev/shm /dev/sda2 920G 270G 651G 30% /scratch dell:/home 296G 140G 141G 50% /home /scratch está formateado en XFS Eth0 está configurada con DHCP y Eth1 no tiene configuración porque no se usa. No hay usuarios locales, se usa NIS: passwd: files nis shadow: files nis group: files nis hosts: files dns bootparams: nisplus [NOTFOUND=return] files ethers: files netmasks: files networks: files protocols: files rpc: files services: files nis netgroup: files publickey: nisplus automount: files aliases: files nisplus Se ejecuta SGE: root 3271 0.0 0.0 5308 1616 ? S Mar10 0:02 /opt/cluster/sge60/bin/lx24-amd64/sge_execd === Instalar nodo === Añadir una entrada con la mac del nodo en el dhcpd.conf Hay que crear un archivo de configuracion (copiando uno que ya este) en /tftpboot/pxelinux.cfg/ con el nombre igual a su ip en hexadecimal. Hay que editar esta linea para que apunte a un archivo de configuracion valido: append initrd=pxeboot/initrd.img ramdisk_size=16384 ksdevice=link ks=http://192.168.0.254/installations/SL5x/sl53.x86_64_ks.cfg === Instalar SGE en nodo === En el home de javier hay algunos scripts y archivos de configuracion que dan pistas, pero todo esta muy desfasado Hay que crear enlaces simbolicos en opt cd /opt/ ln -s /home/cluster/ cluster ln -s /home/cluster/sge60/ sge60 Hay que copiar el nsswitch.conf de otro nodo Exportar variables de entorno: export SGE_ROOT=/home/cluster/sge60/ export PATH=$PATH:/home/cluster/sge60/bin/lx24-amd64 export LD_LIBRARY_PATH=/home/cluster/sge60/lib/lx24-amd64 Añadir a /etc/services: sge_qmaster 1434/tcp sge_execd 1435/tcp Instalar, todo por defecto. cd /home/cluster/sge60 ./install_execd Añadir host y modificar colas en master: qconf -ah dellx qstat -f qmod -mq nombre_de_cola Configurar las características del nodo en SGE: qconf -me dellxx complex_values arch=64,s_vmem=24G,num_proc=24 Donde s_vmem es un soft limit de memoria (igual a la memoria física del host) y num_proc es el número total de nucleos