Skip to content
Simon Byrne edited this page Dec 1, 2023 · 9 revisions

clima.gps.caltech.edu is a GPU node with 8x NVIDIA A100 GPUs.

Getting access

Email help-gps@caltech.edu and request access

Setting up

Unlike central, clima has a handful of modules available. The recommended approach is to install in your home directory.

SSH config

Add to your local ~/.ssh/config file

Host clima
  HostName clima.gps.caltech.edu
  User [username]

To access from outside the network, either use the Caltech VPN

Match final host !ssh.caltech.edu,*.caltech.edu !exec "nc -z -G 1 login.hpc.caltech.edu 22"
  ProxyJump ssh.caltech.edu

About the machine

Storage

  • /home/[username] (capped at 1TB): mounted from sampo, and is backed up
  • /net/sampo/data1 (200TB): mounted from sampo. Not backed up, but somewhat protected by redundant RAID partition
  • /scratch (70TB): fast SSD, not backed up and no RAID redundancy

CPU usage

  • top

GPUs

clima has 8×NVIDIA 80GB A100 GPUs, connected via NVlink.

  • nvidia-smi gives a summary of all the GPUs
    • nvidia-smi topo -m shows the connections between GPUs and CPUs
  • nvtop gives you a live-refresh of current GPU usage

Software

It has a single-node installation of slurm.

We have set up a common environment. You can load this by

module load common

which currently loads

openmpi/4.1.5-cuda julia/1.9.3 cuda/julia-pref

This will set the appropriate Julia preferences, so you should not need to e.g. call MPIPreferences.use_system_binary().

Usage etiquette

Please avoid using clima for long-running CPU-only jobs. The Resnick HPC cluster is better for that.

While GPUs can be used directly, it is always recommended to schedule jobs using Slurm: this prevents allocation of multiple jobs on the same GPU, which can cause significant performance degradation.

For example

$ srun  --gpus=2 --pty bash -l # request a session with 2 GPUs

$ nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-1768fcec-d945-7435-1f8e-85d30cdf310e)
GPU 1: NVIDIA A100-SXM4-80GB (UUID: GPU-6420b6b9-bb34-a58d-8090-61887fd97931)

See also notes on interactive jobs via Caltech-HPC: https://www.hpc.caltech.edu/documentation/slurm-commands

Clone this wiki locally