test2

Table of Contents Becoming root Adding a user account Removing a user account Setting Up ssh keys Rebooting a node Rebooting the cluster Powering down the cluster Powering up the cluster (reverse of power down instructions) Setting clocks to the correct time Head node Compute nodes Troubleshooting nodes and taking them off the batch queue Queues blaze burn Torque configuration changes Adding torque queuing software to a cluster node Restarting ganglia Fixing cluster problems VPN Change email From addresses Synching two svn repositories

Becoming root

To become root, type:

 su -

Then type the password.

Note the "-" in "su -" is important. It causes root's startup files to be invoked (i.e., just typing su will not work).

Adding a user account

If user has an account on the old cluster then use the uid defined in /etc/passwd.BAK on leo.cfr.nist.gov

 useradd -u uid user

On burn use

 useradd -d /home4/user -u uid user

Set password and synchronize

 passwd user
 pwconv
 passsync

Add information to /etc/passwd about the user you just added. For example, change

 jdoe:x:12345:18660::/home/jdoe:/bin/bash

to

 jdoe:x:12345:18660:John Doe (NIST):/home/jdoe:/bin/bash

See /etc/passwd for other examples. Then type passsync to update your changes.

Add a samba account

 smbpasswd -a user

Removing a user account

Edit the /etc/passwd file to remove the user's password entry

Update the passwd file by typing:

 pwconv
 passsync

Remove the user's files with:

 cd /home
 rm -r user_name

Setting Up ssh keys

Type:

 ssh-keygen -t rsa

cd into the directory .ssh and type

 cat id_rsa.pub >> authorized_keys2

Rebooting a node

If a node becomes unresponsive (say node001) type the following command as root

rpower node001 cycle

Rebooting the cluster

Note: I talked with the vendor. They recommend powering down entire cluster if both head node and compute nodes need to be rebooted. If only the compute nodes need to be rebooted then powering down is not necessary.

Login to the blaze console (in the cluster room) as root
Open terminal by selecting the menu item: Applications>Accessories>Terminal
Type POWEROFF (all CAPS). This will power down all compute nodes. It will not power down blaze.
Type: reboot . Note, you are rebooting blaze you are not shutting it down.
After blaze comes back up, log into the console as root and open a terminal as before.
Type POWERON (all CAPS). This will power up all compute nodes.
Run the script /usr/local/bin/check_cluster.sh to verify that all nodes are up and accessible.

Powering down the cluster

login to the blaze console as root
Open a terminal by selecting the menu item: Applications>Accessories>Terminal
Type POWEROFF (all CAPS). This will power down all compute nodes. It will not power down blaze. This script powers down nodes: node001->node032
Repeat the procedure above for burn (ssh to burn from the blaze terminal window)
login to burn and type: poweroff (lowercase) to shutdown burn.
login to blaze and type: poweroff (lowercase) to shutdown blaze.
Turn off the UPS at the bottom of the cluster cabinet.
Pull down four large power switches on wall to right of cluster cabinet to the off position.
Turn off A/C (push small power off button, then turn large black switch to off).

Powering up the cluster (reverse of power down instructions)

Turn on (four) main circuit boxes for blaze (2 on left) and burn (2 on right). But wait to turn on blaze, burn, and firestore.
Wait few minutes - this gives the network switches time to "boot up" .
Turn on UPS for fire80's, etc.
Turn on blaze and burn UPS (bottom of cabinets).
Pull out blaze console and enter password 00000000 (8 0's) The password is located on the blaze console keyboard.
Press power button on blaze. After a few moments the screen should start giving messages.
Push button 2 on the blaze KVM switch to activate the burn console.
Turn on burn master node (red button on right)
After blaze has booted up, type: POWERON to turn on blaze compute nodes.
After burn has booted up, type: POWERON to turn on burn compute nodes.
Press power button on firestore.
Turn on A/C!!! (first, turn large black switch, then hold power on button until unit kicks on).
Run the script /usr/local/bin/check_cluster.sh to verify that all nodes are up and accessible

Setting clocks to the correct time

On blaze, run the following script as root

/usr/local/bin/RESET_CLOCK.sh

or perform the steps outlined below

Head node

/etc/init.d/ntpd stop
ntpdate "0.centos.pool.ntp.org"

Type the above several times until the offset displayed is close to 0.

Note: you may need to wait a few minutes after blaze boots up before this command will work.
hwclock --systohc
/etc/init.d/ntpd start

Compute nodes

fornodesp "/etc/init.d/ntpd stop"
fornodesp "ntpdate blaze"
type the above several times until the offset displayed is close to 0
fornodesp "hwclock --systohc"
fornodesp "/etc/init.d/ntpd start"

Troubleshooting nodes and taking them off the batch queue

For example, to take node011 off of the default (batch) queue, perform the following steps:

Edit /var/spool/torque/server_priv/nodes on blaze, and change the following lines

 node011 np=8 16g compute

to

 node011 np=8 16g testqueue

Create a new queue (mine is called testing_queue) using the following commands:

 qmgr -c "create queue testing_queue"
 qmgr -c "set queue testing_queue queue_type = Execution"
 qmgr -c "set queue testing_queue resources_default.neednodes = testqueue"
 qmgr -c "set queue testing_queue resources_default.nodes = 1"
 qmgr -c "set queue testing_queue enabled = True"
 qmgr -c "set queue testing_queue started = True"

Restart pbs_server and maui

Important! You should restart the PBS server with the following commands (NOT pbs_server stop and NOT pbs_server restart), or you risk stopping jobs that are currently running:

 qterm -t quick
 /etc/init.d/pbs_server start
 /etc/init.d/maui restart

Now, node011 will no longer accept jobs submitted to the default queue (batch), but you have to explicitly call the testing_queue like:

 qfds.sh -r -q testing_queue casename.fds

Useful queuing commands

  list available queues: qmgr -c 'p s'  
  delete a queue: qmgr -c 'delete queue fire60s'

  make sure the following is used when setting up torque
  qmgr -c "set server scheduling = True"

Queues

blaze

Note: only batch is available now.

batch - blaze001->blaze029 - original blaze queue
batch2 - blaze036->blaze71 - nodes in 2nd blaze cabinet
batch3 - blaze072->blaze107 - nodes in 3rd blaze cabinet
batch4 - blaze108->blaze119 - leftover nodes
batch23 - nodes blaze036->blaze107 - note these queue would be used to run lots of single mesh cases or a case using the gigabit switch
batch123 - nodes blaze001->blaze107 - ditto

burn

batch - burn001->burn036

Torque configuration changes

A file named /etc/sysconfig/pbs_mom containing

 #!/bin/bash
 ulimit -s unlimited

was added to each compute node on the burn and blaze cluster. This was to ensure that unlimited stack was available to each node of an openmpi job.

Adding torque queuing software to a cluster node

cd to /opt/atipa/RPMS

install using: *

 rpm -ivf torque-2.5.1-1cri.x86_64.rpm
 rpm -ivf torque-client-2.5.1-1cri.x86_64.rpm
 rpm -ivf torque-mom-2.5.1-1cri.x86_64.rpm

point to torque shared libraries *

 add /usr/local/lib to end of /etc/ld.so.conf
 type: 
 ldconfig
 qterm -q quick
 /etc/init.d/pbs_server start

Restarting ganglia

restart gmond and gmetad on head node with

  /etc/init.d/gmond restart
  /etc/init.d/gmetad restart

restart gmond on all nodes with (gmetad does not run on compute nodes)

/usr/local/bin/ganglia_restart.sh

Fixing cluster problems

Make a User's home directory readable

  chmod 755 ~username

Make all files in a directory tree readable (ie accessible to everyone)

cd to "one" level above the directory you wish to make readable and type the following command. If you don't "own" the directory, you'll need to be root

  chmod +r -R directory_name

Samba is not working

  see if the samba daemon is running by typing: 
  ps -el | grep smb
  in a command shell.  If you don't see anything (or even if you do) type as root:

  /etc/init.d/smb restart

  to restart the daemon

Checking disk usage

  To see how much space is used by the dircectory named dir, type:
  du -ks dir

  To see how much space is used by all files/directories in the current directory, type:
  du -ks `ls`

VPN

 To activate vpn type:
 /etc/init.d/openvpn start

 To deactivate vpn type:
 /etc/init.d/openvpn stop

Change email From addresses

 edit /etc/postmap/canonical by adding lines of the form:
 user@blaze.el.nist.gov user@nist.gov

 postmap /etc/init.d/canonical
 /etc/init.d/postfix restart

Synching two svn repositories

 Create mirror (say on firevis):
 cd /var/www
 svnadmin create svn

 Create hook file:
 cd /var/www/svn/hooks
 echo '#!/bin/sh' > pre-revprop-change
 chmod 755 pre-revprop-change

 First time sync is performed:
 svnsync init file:///var/www/svn http://blaze.el.nist.gov/svn

this returns output for just revision 0

 Command to sync repositories from then on:
 svnsync sync file:///var/www/svn

this returns output for all revision not yet in the synched repository

Note: The FDS-SMV and cfast repositores are synched in /home on blaze. The FIRE-LOCAL repository is synched in /var/www/html on firevis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly