-
Notifications
You must be signed in to change notification settings - Fork 78
Caltech Central Cluster
- Set up multifactor Duo access via the Caltech Helpdesk
- IMSS > Information Security > Duo Request (Cell phone app)
- IMSS > Information Security > Duo Request (Hardware Yubikey token)
- Get added to our HPC account: contact Simon Byrne.
The cluster Login node is on Caltech's private network, and can only be accessed through a VPN connection to Caltech's network or through a SSH jump host:
Steps:
- VPN
- Connect to Caltech VPN
- ssh (user)@login.hpc.caltech.edu
- Password: (user password)
- 2FA authentication
- HPC login node
- SSH
- ssh (user)@ssh.caltech.edu
- Password:
- ssh (user)@login.hpc.caltech.edu
- Password: (user password)
- 2FA authentication
- HPC login node
For more information regarding cluster login, consult the Caltech HPC doucmentation.
See www.hpc.caltech.edu/documentation/
This lets SSH reuse existing connections: you only need to enter password and token once, and while that session stays connected, subsequent sessions can reuse the same connection.
Add the following to your ~/.ssh/config
:
Host login.hpc.caltech.edu
ControlMaster auto
ControlPath ~/.ssh/master-%r@%h:%p
Access to the HPC Cluster is blocked from outside Caltech: to connect you can either
- use the Caltech VPN, or
- tunnel via the Caltech Unix Cluster: add the following to your
~/.ssh/config
:
Match host login.hpc.caltech.edu !exec "nc -z login.hpc.caltech.edu 22"
ProxyJump ssh.caltech.edu
Host ssh.caltech.edu
User <caltech username>
This experience can be further improved by adding your public key to the ssh.caltech.edu
.
Unfortunately, the HPC cluster doesn't support passwordless login via public/private keys due to the 2FA requirement. However combined with the trick above, you can integrate it with your password manager (such as LastPass)
- Install the command-line utility for your password manager (
lastpass-cli
for LastPass)
- figure out how to query it for your Caltech password (e.g.
lpass show --color=never --password caltech.edu
for LastPass).
-
(if on a Mac) install
util-linux
from homebrew -
Make sure
~/bin
is in your PATH environment
- if not, add
export PATH="$PATH:$HOME/bin"
to your~/.profile
/~/.bashrc
file.
- Create the following files:
-
~/bin/sshhpc
:
#! /usr/bin/env bash
set -euo pipefail
export SSH_ASKPASS=~/bin/sshhpc-password
export SSH_ASKPASS_REQUIRE=prefer
export HPCPASS=`lpass show --color=never --password caltech.edu` # or whatever query string you use
rm -f ~/.nextprompt
ssh login.hpc.caltech.edu -fN
-
~/bin/sshhpc-password
:
#! /usr/bin/env bash
if [ -f ~/.nextprompt ]; then
echo 1
rm ~/.nextprompt
else
echo "$HPCPASS"
touch ~/.nextprompt
fi
- Make
sshpc
andsshpc-password
executable withchmod +x <filepath>
.
Now you can run sshhpc
which will start up a background ssh session, prompting for your password manager password and 2FA token. Subsequent calls to ssh login.hpc.caltech.edu
should work without any prompts.
To get an interactive session with a GPU for 2 hours
salloc -t 02:00:00 -n 1 -N 1 --gres=gpu:1
You can alternatively use
srun --pty -t 02:00:00 -n 1 -N 1 --gres gpu:1 /bin/bash -l
but once the session has started, you should
unset SLURM_STEP_ID
if using Julia with MPI.
Note that you need Julia 1.5.2 is the currently supported version.
You can
module load julia/1.5.2
To use (CUDA-aware) MPI
module load cuda/10.2 openmpi/4.0.4_cuda-10.2
Then rebuild MPI.jl
julia --project -e 'ENV["JULIA_MPI_BINARY"]="system"; using Pkg; Pkg.build("MPI"; verbose=true)'
We have one reserved node with 4 GPUs on the cluster, which is accessible to all members of the group by adding
--reservation=clima
to your srun
or sbatch
command. This is intended for interactive development and short tests: please don't use this for long batch jobs.
#!/bin/bash
#SBATCH --nodes=1 # number of nodes
#SBATCH --tasks-per-node=2 # number of MPI ranks per node
#SBATCH --cpus-per-task=1 # number of CPU threads per MPI rank
#SBATCH --time=1:00:00 # walltime
set -euo pipefail # kill the job if anything fails
set -x # echo script
module purge
module load julia/1.5.2 hdf5/1.10.1 netcdf-c/4.6.1 openmpi/4.0.1
export JULIA_NUM_THREADS=${SLURM_CPUS_PER_TASK:=1}
export JULIA_MPI_BINARY=system
export JULIA_CUDA_USE_BINARYBUILDER=false
# run instantiate/precompile serial
julia --project -e 'using Pkg; Pkg.instantiate(); Pkg.build()'
julia --project -e 'using Pkg; Pkg.precompile()'
mpiexec julia --project myscript.jl
The ClimateMachine
is CUDA-enabled and will use GPU(s) if available. To run on the CPU, set the CLIMATEMACHINE_SETTINGS_DISABLE_GPU
environment variable to true
. This can either be done inline with the Julia launch command using
CLIMATEMACHINE_SETTINGS_DISABLE_GPU=true julia --project
or for the whole shell session, for example with bash
this would be
export CLIMATEMACHINE_SETTINGS_DISABLE_GPU=true
If starting multiple jobs, then move the instantiate/build/precompile to a separate job, and add the other jobs as dependencies by passing the --dependency=afterok:<jobid>
argument to sbatch
.
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --tasks-per-node=2 # number of MPI ranks per node
#SBATCH --gres=gpu:2 # GPUs per node; should equal tasks-per-node
#SBATCH --time=01:00:00
set -euo pipefail # kill the job if anything fails
set -x # echo script
module purge
module load julia/1.5.2 hdf5/1.10.1 netcdf-c/4.6.1 cuda/10.2 openmpi/4.0.4_cuda-10.2 # CUDA-aware MPI
export JULIA_NUM_THREADS=${SLURM_CPUS_PER_TASK:=1}
export JULIA_MPI_BINARY=system
export JULIA_CUDA_USE_BINARYBUILDER=false
julia --project -e 'using Pkg; Pkg.instantiate(); Pkg.build()'
julia --project -e 'using Pkg; Pkg.precompile()'
mpiexec julia --project myscript.jl
- Some modules (notably openmpi and netcdf) can get messed up if you load them on the login node beforehand submitting the job. The solution seems to be doing
module purge
before loading modules on the worker node.