Skip to content

Using Slurm to submit jobs on Cannon (Part 1)

Bob Yantosca edited this page Feb 2, 2024 · 17 revisions

The Cannon cluster uses the Slurm to manage its computational resources. Here we provide a brief overview of how you can use Slurm to schedule jobs.

We also recommend that you read the Running Jobs page on the FASRC documentation site, which contains more detailed information about Slurm.

Where will I run my jobs?

The SEAS partitions

As of Jan 1, 2023, all of the hardware owned by PIs in Harvard-SEAS has been pooled together into the following partitions:

  • huce_cascade: Approximately 6400 compute cores. Suitable for GEOS-Chem and/or GCHP simulations.
  • seas_compute: Approximately 5000 compute cores. Suitable for GEOS-Chem and/or GCHP simulations.
  • sapphire: Approximately 21,500 compute cores. Suitable for GEOS-Chem and/or GCHP simulations.
  • seas_gpu: Approximately 2500 Graphical Processor Units (GPUs). Suitable for Machine Learning and related applications.

For more information, please see the SEAS compute resources page at the FASRC documentation site.

The test partition

You can submit interactive jobs (i.e. a command-line window on a computational node) to any partition. Interactive jobs are particularly useful for compiling GEOS-Chem and/or GCHP, or for running interative data analysis/plotting code.

Submitting your interactive jobs to the seas_compute partition might result in long wait times, and might also increase your fairshare score. For this reason, we recommend using the test partition for all interactive sessions. The test partition allows you to use the following resources:

  • Up to 5 simultaneous interactive sessions
  • Up to 12 hours of requested time
  • 96 cores per user
  • 384 GB memory per user

Other available partitions

Cannon has other partitions (described here in detail) that you can use. However, your job will be competing for resources with users across the entire Cannon cluster.

Useful SLURM commands

Before we go too much further, please take a moment to review some of the more commonly used SLURM commands

Requesting interactive jobs

When you log into Cannon, you will be placed into a login node. The login nodes are sufficient for light computation, but for more CPU-intensive tasks (e.g. running GEOS-Chem in interactive mode, compiling with more than one processor, running IDL scripts), you should request resources with the SLURM salloc command, and then log into the resources provided with ssh.

Example: Interactive job (4 cores, 8GB memory, 2 hours)

salloc --x11=all -c 4 -N 1 --mem=8000 -t 0-02:00 -p test            
source ~/envs/gcc_cmake.gfortran102_cannon.env

Example: Interactive job (8 cores, 12GB memory, 8 hours)

salloc --x11=all -c 8 -N 1 --mem=12000 -t 0-08:00 -p test
source ~/envs/gcc_cmake.gfortran102_cannon.env                               

salloc command syntax

The SLURM salloc command takes these arguments:

salloc --x11=all -c <NUMBER-OF-CORES> -N <NUMBER-OF-NODES> --mem=<MEM> -t <TIME> -p <PARTITION>

where:

-p <PARTITION-NAME>

  • Requests a specific partition (aka queue) for the resource allocation. We recommend starting all interactive sessions in the Cannon test partition. |

--x11=all

  • Starts X11 display (for graphical window display).

-c <NUMBER-OF-CORES>

  • Specifies the number of cores per node that your job will use.

-N <NUMBER-OF-NODES>

  • Requests the number of nodes that will be allocated to this job.
    • For GEOS-Chem "Classic" simulations, you can only use 1 node due to limitations of the OpenMP parallelization.
    • For GCHP simulations, you may use more than one node.

--mem=<MB>

  • Specifies the real memory required per node in MegaBytes.

-t <TIME>

  • Specifies the time limit for the interactive job in minutes. Acceptable formats for time are minutes, hours:minutes:seconds, and days-hours:minutes.

After you request an interactive session, you may notice that your login prompt may change. For example, when you log into cannon using login.rc.fas.harvard.edu, your unix prompt may have looked like this:

USER@holylogin04 $

But in the interactive session, your prompt may look something like this:

USER@holyc19315 $

NOTE: if you are on the one of the holy* nodes on cannon, then this means you are on a machine in holyoke, ma (about 100 miles from Harvard).

Make sure to set OMP_NUM_THREADS properly

Note that SLURM only requests a number of CPUs from the system, but it will not actually tell GEOS-Chem how many cores to use. Parallelized GEOS-Chem simulations will use the number of cores specified by the environment variable $OMP_NUM_THREADS.

$OMP_NUM_THREADS will be set automatically for you when you source one of the GEOS-Chem environment files. This will set $OMP_NUM_THREADS to the same number of CPUs that you requested in your interactive session.

If for some reason you wanted to change the value of $OMP_NUM_THREADS within an interactive session, simply type:

export OMP_NUM_THREADS=<NUMBER-OF_CORES>

where <NUMBER-OF-CORES> is the new number of cores that you want to use.

Problem with Cannon interactive sessions freezing up

Cannon interactive sessions will freeze if left idle for more than an hour. An easy way to prevent this from happening is to open a new tmux session once your interactive job starts. Use this command:

$ tmux -a -t my_session

Before logging out of the interactive session, terminate the tmux session by typing:

$ exit

Accessing sites via SSH from Cannon interactive jobs

We have found that forwarding your SSH private key from your PC or Mac does not propagate to Cannon interactive sessions properly. We recommend setting up another keypair on Cannon and adding the corresponding public key to any sites that you need to access via ssh (such as GitHub). Once your Cannon interactive job starts, run the ssh-agent with these commands:

$ eval $(ssh-agent -s)
$ ssh-add ~/.ssh/YOUR-PRIVATE-KEY-ON-CANNON

Then add the corresponding public key to all websites (e.g. GitHub) that you would like to access from within the interactive session.

Note that YOUR-PRIVATE-KEY-ON-CANNON must be readable and editable only by you (i.e. with rw-------, aka chmod 600). For more information, please see the Set up SSH keys page.

Clone this wiki locally