-
Notifications
You must be signed in to change notification settings - Fork 0
test2
- To become root, type:
su -
- Then type the password.
If user has an account on the old cluster then use the uid defined in /etc/passwd.BAK on leo.cfr.nist.gov
useradd -u uid user
On burn use
useradd -d /home4/user -u uid user
Set password and synchronize
passwd user pwconv passsync
Add information to /etc/passwd about the user you just added. For example, change
jdoe:x:12345:18660::/home/jdoe:/bin/bash
to
jdoe:x:12345:18660:John Doe (NIST):/home/jdoe:/bin/bash
See /etc/passwd for other examples. Then type passsync to update your changes.
Add a samba account
smbpasswd -a user
Edit the /etc/passwd file to remove the user's password entry
Update the passwd file by typing:
pwconv passsync
Remove the user's files with:
cd /home rm -r user_name
Type:
ssh-keygen -t rsa
cd into the directory .ssh and type
cat id_rsa.pub >> authorized_keys2
If a node becomes unresponsive (say node001) type the following command as root
rpower node001 cycle
Note: I talked with the vendor. They recommend powering down entire cluster if both head node and compute nodes need to be rebooted. If only the compute nodes need to be rebooted then powering down is not necessary.
- Login to the blaze console (in the cluster room) as root
- Open terminal by selecting the menu item: Applications>Accessories>Terminal
- Type POWEROFF (all CAPS). This will power down all compute nodes. It will not power down blaze.
- Type: reboot . Note, you are rebooting blaze you are not shutting it down.
- After blaze comes back up, log into the console as root and open a terminal as before.
- Type POWERON (all CAPS). This will power up all compute nodes.
- Run the script /usr/local/bin/check_cluster.sh to verify that all nodes are up and accessible.
- login to the blaze console as root
- Open a terminal by selecting the menu item: Applications>Accessories>Terminal
- Type POWEROFF (all CAPS). This will power down all compute nodes. It will not power down blaze. This script powers down nodes: node001->node032
- Repeat the procedure above for burn (ssh to burn from the blaze terminal window)
- login to burn and type: poweroff (lowercase) to shutdown burn.
- login to blaze and type: poweroff (lowercase) to shutdown blaze.
- Turn off the UPS at the bottom of the cluster cabinet.
- Pull down four large power switches on wall to right of cluster cabinet to the off position.
- Turn off A/C (push small power off button, then turn large black switch to off).
- Turn on (four) main circuit boxes for blaze (2 on left) and burn (2 on right). But wait to turn on blaze, burn, and firestore.
- Wait few minutes - this gives the network switches time to "boot up" .
- Turn on UPS for fire80's, etc.
- Turn on blaze and burn UPS (bottom of cabinets).
- Pull out blaze console and enter password 00000000 (8 0's) The password is located on the blaze console keyboard.
- Press power button on blaze. After a few moments the screen should start giving messages.
- Push button 2 on the blaze KVM switch to activate the burn console.
- Turn on burn master node (red button on right)
- After blaze has booted up, type: POWERON to turn on blaze compute nodes.
- After burn has booted up, type: POWERON to turn on burn compute nodes.
- Press power button on firestore.
- Turn on A/C!!! (first, turn large black switch, then hold power on button until unit kicks on).
- Run the script /usr/local/bin/check_cluster.sh to verify that all nodes are up and accessible
On blaze, run the following script as root
/usr/local/bin/RESET_CLOCK.sh
or perform the steps outlined below
- /etc/init.d/ntpd stop
- ntpdate "0.centos.pool.ntp.org"
- Type the above several times until the offset displayed is close to 0.
- Note: you may need to wait a few minutes after blaze boots up before this command will work.
- hwclock --systohc
- /etc/init.d/ntpd start
- fornodesp "/etc/init.d/ntpd stop"
- fornodesp "ntpdate blaze"
- type the above several times until the offset displayed is close to 0
- fornodesp "hwclock --systohc"
- fornodesp "/etc/init.d/ntpd start"
For example, to take node011 off of the default (batch) queue, perform the following steps:
- Edit /var/spool/torque/server_priv/nodes on blaze, and change the following lines
node011 np=8 16g compute
to
node011 np=8 16g testqueue
- Create a new queue (mine is called testing_queue) using the following commands:
qmgr -c "create queue testing_queue" qmgr -c "set queue testing_queue queue_type = Execution" qmgr -c "set queue testing_queue resources_default.neednodes = testqueue" qmgr -c "set queue testing_queue resources_default.nodes = 1" qmgr -c "set queue testing_queue enabled = True" qmgr -c "set queue testing_queue started = True"
- Restart pbs_server and maui
qterm -t quick /etc/init.d/pbs_server start /etc/init.d/maui restart
Now, node011 will no longer accept jobs submitted to the default queue (batch), but you have to explicitly call the testing_queue like:
qfds.sh -r -q testing_queue casename.fds
- Useful queuing commands
list available queues: qmgr -c 'p s' delete a queue: qmgr -c 'delete queue fire60s'
make sure the following is used when setting up torque qmgr -c "set server scheduling = True"
Note: only batch is available now.
- batch - blaze001->blaze029 - original blaze queue
- batch2 - blaze036->blaze71 - nodes in 2nd blaze cabinet
- batch3 - blaze072->blaze107 - nodes in 3rd blaze cabinet
- batch4 - blaze108->blaze119 - leftover nodes
- batch23 - nodes blaze036->blaze107 - note these queue would be used to run lots of single mesh cases or a case using the gigabit switch
- batch123 - nodes blaze001->blaze107 - ditto
- batch - burn001->burn036
A file named /etc/sysconfig/pbs_mom containing
#!/bin/bash ulimit -s unlimited
was added to each compute node on the burn and blaze cluster. This was to ensure that unlimited stack was available to each node of an openmpi job.
cd to /opt/atipa/RPMS
- install using: *
rpm -ivf torque-2.5.1-1cri.x86_64.rpm rpm -ivf torque-client-2.5.1-1cri.x86_64.rpm rpm -ivf torque-mom-2.5.1-1cri.x86_64.rpm
- point to torque shared libraries *
add /usr/local/lib to end of /etc/ld.so.conf type: ldconfig qterm -q quick /etc/init.d/pbs_server start
- restart gmond and gmetad on head node with
/etc/init.d/gmond restart /etc/init.d/gmetad restart
- restart gmond on all nodes with (gmetad does not run on compute nodes)
- Make a User's home directory readable
chmod 755 ~username
- Make all files in a directory tree readable (ie accessible to everyone)
chmod +r -R directory_name
- Samba is not working
see if the samba daemon is running by typing: ps -el | grep smb in a command shell. If you don't see anything (or even if you do) type as root:
/etc/init.d/smb restart
to restart the daemon
- Checking disk usage
To see how much space is used by the dircectory named dir, type: du -ks dir
To see how much space is used by all files/directories in the current directory, type: du -ks `ls`
To activate vpn type: /etc/init.d/openvpn start
To deactivate vpn type: /etc/init.d/openvpn stop
edit /etc/postmap/canonical by adding lines of the form: user@blaze.el.nist.gov user@nist.gov
postmap /etc/init.d/canonical /etc/init.d/postfix restart
Create mirror (say on firevis): cd /var/www svnadmin create svn
Create hook file: cd /var/www/svn/hooks echo '#!/bin/sh' > pre-revprop-change chmod 755 pre-revprop-change
First time sync is performed: svnsync init file:///var/www/svn http://blaze.el.nist.gov/svn
this returns output for just revision 0
Command to sync repositories from then on: svnsync sync file:///var/www/svn
this returns output for all revision not yet in the synched repository
Note: The FDS-SMV and cfast repositores are synched in /home on blaze. The FIRE-LOCAL repository is synched in /var/www/html on firevis.