Skip to content

Executing Albany on Ride or White

Jerry Watkins edited this page Feb 8, 2017 · 1 revision

These execution instructions are for running Albany on the Ride or White IBM Power8 GPU clusters at Sandia National Laboratories. Batch scripts are used to submit jobs to a queue manager. The script will run when resources become available.

As of February 2017, Ride and White are split into three queues, each having different numbers of nodes and GPUs:

Ride Name Node Names Number of Nodes GPU Model Number of GPUs per Node
Firestone nodes (default queue) rhel7F ride7 - ride16 10 K80 (12GB) 4
Garrison nodes rhel7G ride17 - ride28 12 P100 (16GB) 4
Tuleta nodes rhel7T ride2 - ride5 4 K40m (12GB) 2
White Name Node Names Number of Nodes GPU Model Number of GPUs per Node
Firestone nodes (default queue) rhel7F white20 - white27 8 K80 (12GB) 4
Garrison nodes rhel7G white28 - white35 8 P100 (16GB) 4
Tuleta nodes rhel7T white13 - white19 7 K40m (12GB) 2

Ride and White use LSF as a resource manager and job scheduler. Here is a list of useful commands:

  • bsub -Is bash - Submit an interactive job to the LSF system
  • bsub < [BatchScriptFile] - Submit a batch job to the LSF system where [BatchScriptFile] refers to the batch script file being used
  • bkill - Kill a running job
  • bjobs - See the status of user jobs in the LSF queue
  • bjobs -u all - See the status of all jobs in the LSF queue
  • bqueues - Information about LSF batch queues
  • bqueues -l - More detailed information about the settings for each queue

A useful reference for LSF commands can be found here.

Executing MPI+GPU jobs with Kokkos::Cuda

The following script executes Albany with 8 MPI ranks across 2 nodes (4 ranks per node). Since each GPU pair is connected to a socket, --map-by ppr:2:socket is used to set 2 MPI ranks per socket. --kokkos-ndevices=4 is used to set the number of GPUs used per node.

#!/bin/bash -login

#BSUB -J MPIGPUjob          # Job Name
#BSUB -o MPIGPUjob.%J.out   # Standard output filename (%J is the job number)
#BSUB -e MPIGPUjob.%J.err   # Standard error filename
#BSUB -q rhel7G             # Queue Name
#BSUB -m "ride27 ride28"    # Node Names
#BSUB -n 8                  # Number of processors
#BSUB -R "span[ptile=4]"    # Number of processors per node
#BSUB -W 02:00              # Runtime limit [Hours]:[Minutes]
#BSUB -x                    # No other jobs can run on this node

# Limit disk usage for large files
ulimit -c 0

# Load modules
source ${HOME}/Albany/doc/ride-white/modules_cuda.sh

# Run MPIGPU job
mpirun -n 8 --map-by ppr:2:socket [AlbanyExecutable] [InputFile] --kokkos-ndevices=4
Clone this wiki locally