Caltech Central Cluster

Getting set up

Set up multifactor Duo access via the Caltech Helpdesk
- IMSS > Information Security > Duo Request (Cell phone app)
- IMSS > Information Security > Duo Request (Hardware Yubikey token)
Get added to our HPC account: contact Simon Byrne.

The cluster Login node is on Caltech's private network, and can only be accessed through a VPN connection to Caltech's network or through a SSH jump host:

Steps:

VPN
- Connect to Caltech VPN
- ssh (user)@login.hpc.caltech.edu
- Password: (user password)
- 2FA authentication
- HPC login node
SSH
- ssh (user)@ssh.caltech.edu
- Password:
- ssh (user)@login.hpc.caltech.edu
- Password: (user password)
- 2FA authentication
- HPC login node

For more information regarding cluster login, consult the Caltech HPC doucmentation.

Cluster Documentation

See www.hpc.caltech.edu/documentation/

Tips

Enable sharing of SSH sessions

This lets SSH reuse existing connections: you only need to enter password and token once, and while that session stays connected, subsequent sessions can reuse the same connection.

Add the following to your ~/.ssh/config:

Host login.hpc.caltech.edu
  ControlMaster auto
  ControlPath ~/.ssh/master-%r@%h:%p

Connecting from outside Caltech

Access to the HPC Cluster is blocked from outside Caltech: to connect you can either

use the Caltech VPN, or
tunnel via the Caltech Unix Cluster: add the following to your ~/.ssh/config:

Match host login.hpc.caltech.edu !exec "nc -z login.hpc.caltech.edu 22"
  ProxyJump ssh.caltech.edu

Host ssh.caltech.edu
  User <caltech username>

This experience can be further improved by adding your public key to the ssh.caltech.edu.

Use with a password manager

Unfortunately, the HPC cluster doesn't support passwordless login via public/private keys due to the 2FA requirement. However combined with the trick above, you can integrate it with your password manager (such as LastPass)

Install the command-line utility for your password manager (lastpass-cli for LastPass)

figure out how to query it for your Caltech password (e.g. lpass show --color=never --password caltech.edu for LastPass).

(if on a Mac) install util-linux from homebrew
Make sure ~/bin is in your PATH environment

if not, add export PATH="$PATH:$HOME/bin" to your ~/.profile/~/.bashrc file.

Create the following files:

~/bin/sshhpc:

#! /usr/bin/env bash
set -euo pipefail

export SSH_ASKPASS=~/bin/sshhpc-password
export SSH_ASKPASS_REQUIRE=prefer
export HPCPASS=`lpass show --color=never --password caltech.edu` # or whatever query string you use

rm -f ~/.nextprompt

ssh login.hpc.caltech.edu -fN

~/bin/sshhpc-password:

#! /usr/bin/env bash
if [ -f ~/.nextprompt ]; then
    echo 1
    rm ~/.nextprompt
else
    echo "$HPCPASS"
    touch ~/.nextprompt
fi

Make sshpc and sshpc-password executable with chmod +x <filepath>.

Now you can run sshhpc which will start up a background ssh session, prompting for your password manager password and 2FA token. Subsequent calls to ssh login.hpc.caltech.edu should work without any prompts.

Interactive sessions

To get an interactive session with a GPU for 2 hours

salloc -t 02:00:00 -n 1 -N 1 --gres=gpu:1

You can alternatively use

srun --pty -t 02:00:00 -n 1 -N 1 --gres gpu:1 /bin/bash -l

but once the session has started, you should

unset SLURM_STEP_ID

if using Julia with MPI.

Note that you need Julia 1.5.2 is the currently supported version.

You can

module load julia/1.5.2

To use (CUDA-aware) MPI

module load cuda/10.2 openmpi/4.0.4_cuda-10.2

Then rebuild MPI.jl

julia --project -e 'ENV["JULIA_MPI_BINARY"]="system"; using Pkg; Pkg.build("MPI"; verbose=true)'

Reserved node

We have one reserved node with 4 GPUs on the cluster, which is accessible to all members of the group by adding

--reservation=clima

to your srun or sbatch command. This is intended for interactive development and short tests: please don't use this for long batch jobs.

Sample `sbatch` scripts

CPU only

#!/bin/bash

#SBATCH --nodes=1          # number of nodes
#SBATCH --tasks-per-node=2 # number of MPI ranks per node
#SBATCH --cpus-per-task=1  # number of CPU threads per MPI rank
#SBATCH --time=1:00:00     # walltime

set -euo pipefail # kill the job if anything fails
set -x # echo script

module purge
module load julia/1.5.2 hdf5/1.10.1 netcdf-c/4.6.1 openmpi/4.0.1

export JULIA_NUM_THREADS=${SLURM_CPUS_PER_TASK:=1}
export JULIA_MPI_BINARY=system
export JULIA_CUDA_USE_BINARYBUILDER=false

# run instantiate/precompile serial
julia --project -e 'using Pkg; Pkg.instantiate(); Pkg.build()'
julia --project -e 'using Pkg; Pkg.precompile()'
mpiexec julia --project myscript.jl

The ClimateMachine is CUDA-enabled and will use GPU(s) if available. To run on the CPU, set the CLIMATEMACHINE_SETTINGS_DISABLE_GPU environment variable to true. This can either be done inline with the Julia launch command using

CLIMATEMACHINE_SETTINGS_DISABLE_GPU=true julia --project

or for the whole shell session, for example with bash this would be

export CLIMATEMACHINE_SETTINGS_DISABLE_GPU=true

If starting multiple jobs, then move the instantiate/build/precompile to a separate job, and add the other jobs as dependencies by passing the --dependency=afterok:<jobid> argument to sbatch.

GPU

#!/bin/bash                                                                                                                    

#SBATCH --nodes=1
#SBATCH --tasks-per-node=2 # number of MPI ranks per node
#SBATCH --gres=gpu:2       # GPUs per node; should equal tasks-per-node
#SBATCH --time=01:00:00


set -euo pipefail # kill the job if anything fails                                                                             
set -x # echo script

module purge
module load julia/1.5.2 hdf5/1.10.1 netcdf-c/4.6.1 cuda/10.2 openmpi/4.0.4_cuda-10.2 # CUDA-aware MPI

export JULIA_NUM_THREADS=${SLURM_CPUS_PER_TASK:=1}
export JULIA_MPI_BINARY=system
export JULIA_CUDA_USE_BINARYBUILDER=false

julia --project -e 'using Pkg; Pkg.instantiate(); Pkg.build()'
julia --project -e 'using Pkg; Pkg.precompile()'
mpiexec julia --project myscript.jl

Stuff to be aware of

Some modules (notably openmpi and netcdf) can get messed up if you load them on the login node beforehand submitting the job. The solution seems to be doing module purge before loading modules on the worker node.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caltech Central Cluster

Getting set up

Cluster Documentation

Tips

Enable sharing of SSH sessions

Connecting from outside Caltech

Use with a password manager

Interactive sessions

Reserved node

Sample `sbatch` scripts

CPU only

GPU

Stuff to be aware of

Clone this wiki locally

Caltech Central Cluster

Getting set up

Cluster Documentation

Tips

Enable sharing of SSH sessions

Connecting from outside Caltech

Use with a password manager

Interactive sessions

Reserved node

Sample sbatch scripts

CPU only

GPU

Stuff to be aware of

Clone this wiki locally

Sample `sbatch` scripts