When dealing with computational problems, various resource and time limitations could arise difficulties and disrupt the solutions. If we have enough time and resources, we might find the answer. A supercomputer or cluster with high level of performance could help us tackle the problem. The followings are some great workshops about HPC:
- Introduction to High-Performance Computing
- Introduction to using the shell in a High-Performance Computing context
Using HPC systems often involves the use of a Shell through a command line interface which is necessary for this topic (see here).
This tutorial provides a list of basic scheduling commands, submitting jobs, methods of transferring files from local computers, and installing software on clusters.
Related document:
On an HPC system, we need a scheduler to manage how jobs running on a cluster. One of the most common schedulers is SLURM. The following are some practical SLURM commands (quick start user guide):
sinfo -s # shows summary info about all patritions
sjstat -c # shows computing resources info
srun # run parallel jobs
sbatch # submit a job to the scheduler
JOB_ID=$(sbatch --parsable file.sh) # keep the JOB ID right after reading the command
sbatch --dependency=afterok:JOB_ID file.sh # submit a job file after finishing other jobs
sbatch --dependency=singleton # submit a job after ending a job with a same name
sacct # displays accounting data for all jobs and job steps in the SLURM job accounting log
squeue -u <userid> # check on a user's job status
squeue -u <userid> --start # show estimation time to start pending jobs
scancel JOBID # cancel the job with JOBID
scancel -u <useride> # cancel all the user jobs
To see more details about these commands use <command> --help
.
Let’s connect to the cluster through ssh user@server
, and do some
practices. For example, use nano example-job.sh
to make a job file
including:
#!/bin/bash
#SBATCH --nodes 1
#SBATCH --mem 16G
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 4
#SBATCH --partition hpc0
#SBATCH --account general
#SBATCH --time 02-05:00
#SBATCH --job-name NewJobName
#SBATCH --mail-user your@email.com
#SBATCH --mail-type END
echo 'This script is running on:'
hostname
sleep 120
Special characters #!
(shebang) at the beginning of scripts specifies
what program should be used (i.e. /bin/bash
or /usr/bin/python3
).
SLURM uses #SBATCH
special comment to denote special
scheduler-specific options. To see more options, use sbatch --help
.
For example, the above file uses 1 nodes, 16 gigabytes of memory, 1 taks
and 4 CPUs per task, and using partition hpc0
with a general
account
for 2 days and 5 hours of walltime, gives a new name to the job, and
email you when the job is ended. Now we can submit the job file by
sbatch example-job.sh
. We can use squeue -u USER
or sacct
to check
the job file status, and use scancel JOBID
to cancel the job. You may
find more sbatch options here.
To run a single command, we can use srun
. For instance,
srun -c 2 echo "This job will use 2 CPUs."
submits a job and allocates
2 CPUs. Also, we can use srun
to open a program in an interaction
mode. For example, srun --pty bash
will open a Bash shell in a
computation node (not specified).
Note: in general, when we connect to a cluster we will go to a node,
called login node, which is not meant to do heavy computational
tasks. So, to do our computations in a proper way, we should always use
either sbatch
or srun
.
Usually there are many modules available on the clusters. To find and load these modules use:
module avail # shows all avaliable madules (programs) in the cluster
module load <name> # to load a module ex. module load R or python
module list # shows list of the loaded modules
module unload <name> # to unload a module
module purge # to unload all modules
To create a simple template sbatch
job file, use the following steps:
- generate any files including all codes that we want to run in the cluster (that could be several Python or R or other scripts)
- generate a Bash file including all modules that are required for the
sending job (
environment.sh
) - generate a Bash file to call steps 1 and 2 including all
#SBATCH
options (job_file.sh
) - use
sbatch
to run the file in step 3
For example, let’s run the following Python code called test.py
:
#!/usr/bin/python3
print("Hello world")
Then use nano environment.sh
to create the environment file including:
#!/bin/bash
module load miniconda3
Then use nano job-test.sh
to make the job file by:
#!/bin/bash
#SBATCH --mem 1G
#SBATCH --job-name Test1
echo === $(date)
echo $SLURM_JOB_ID
source ./environment.sh
module list
srun python3 ./test.py
echo === $(date) $(hostname)
Now we can use sbach job-test.sh
to run this job.
If there are some dependencies between jobs, slurm
can defer the start
of a job until the specified dependencies have been satisfied completed.
For instance, let’s create another job called job-test-2.sh
:
#!/bin/bash
#SBATCH --mem 1G
#SBATCH --job-name Test2
echo === $(date)
echo $SLURM_JOB_ID
echo === This is a new job
echo === $(date) $(hostname)
We need another job, called job-test-3.sh
, to run both job-test.sh
and job-test-2.sh
:
#!/bin/bash
#SBATCH --mem 1G
#SBATCH --job-name Dependency
echo === $(date)
JID=$(sbatch --parsable job-test.sh)
echo $JID
sbatch --dependency=afterok:$JID job-test-2.sh
echo === $(date) $(hostname)
Where JID
is the job ID for sbatch job-test.sh
which is a dependency
for job-test-2.sh
. Now, by running sbatch job-test-3.sh
we will make
sure that job-test-2.sh
will run after that job-test.sh
is completed
successfully.
Note that there are some other tools, such as Snakemake, that could be used for workflow management.
We can use secure copy or scp
to transfer files from a local computer
to a cluster and vice versa. For example, let’s transfer
code_example.py
from temp/
directory on the remote.edu
cluster to
Documents/
directory in your local computer when we are using the
local computer. For this we can use:
cd ~/Documents
scp user@remote.edu:/temp/code_example.py .
The .
at the end of the code says, paste with the same name of the
source file. To do the reverse task with:
cd ~/Documents
scp code_example.py user@remote.edu:/temp/.
To recursively copy a directory (with all files in the directory), we just need to add the -r (recursive) flag. For example to download temp folder use:
cd ~/Documents
scp -r user@remote.edu:/temp .
Rsync is a fast, versatile, remote (and local) file-copying tool. Rsync
has two great features, first it always syncs your data (i.e. only
transfer files that changed since last transfer) and second, the
compress option that makes transferring large files easier. To use
rsync
, follow:
# From local to remote
rsync local-directory user@remote.edu:remote-directory
# From remote to local
rsync user@remote.edu:remote-directory local-directory
That recursively transfer all files from the directory local-directory
on the local machine into the remote-directory
on the remote machine.
Some important options for rsync
are (use rsync -help
to see all
options):
-r, --recursive
: recurse into directories-v, --verbose
: increase verbosity-h, --human-readable
: human-readable format-z, --compress
: compress file data during the transfer-P, --partial --progress
: to keep partially transferred files which should make a subsequent transfer of the rest of the file much faster
For example:
rsync -rPz ./home/myfiles user@remote.edu:./myproject
Will transfer files in “partial” mode from ./home/myfiles/
in the
local machine to the remote ./myproject
directory. Additionally,
compression will be used to reduce the size of data portions of the
transfer.
The other method to transfer data between the local machine and a
cluster is SSH file transfer protocol or sftp
. The great advantage of
this way is tab completion in both local and remote that makes finding
origins and destinations much easier. We can connect to a cluster
through sftp
very similar to ssh
by running sftp username@server
.
We can also use most of the Bash commands when using sftp
and at the
same time access to both cluster and local computer. Usually we can
apply the bash commands in the local system by adding l
to the
beginning of the commands. For example:
pwd # print working directory on the cluster
lpwd # print working directtory on the local computer
cd # chnage directory on the cluster
lcd # change directtory on the local computer
Use put
to upload and get
to download a file.
To get data from web, also, we can use wget
command to download files
on the cluster.
When we login to a cluster, as a user we only have permission to change
user level files (home directory cd ~
and higher). So, in this case,
we never be able to install/update software that are located in the root
directory (cd /
). Note that we can find the location of software by
module show <software-name>
command.
As a cluster user, we have several ways to build our own system and install and update our required software:
-
Python: If we only need several Python packages probably the easiest way is making a virtual environment by
venv
module in Python3. After that we will be able to usepip
package manager to install packages. -
Miniconda: It let us install many software including Python, R and their packages. We can try
module load anaconda3
to load the module and then useconda
to create a virtual environment and install software and packages. Note that if the cluster does not includeminiconda3
, then you may use the third option to install it first. Review Virtual environments in Python to learn more. -
Spack: it gives more variety of software and packages to install (see here). To use Spack, we need to install it on a local directory (which is
cd ~
and above) and then usespack
to install and load packages. Note that this way might take more time to install Spack and required modules, so, first make sure the second option could not install your requirements. Review Install software with Spack to learn more. -
Manually: Still there are many software that are not available through Conda or Spack. We should follow the software instruction to install them. Make sure to review README or INSTALL file (if exist) and check configure options,
./configure --help
, in the installation directory. Since, we are not using root directory, make sure to use the right directory that all the dependencies already installed (i.e./configure --prefix=${PWD}
). This method could be the hardest way, so, first make sure Conda and Spack could not help you. Note that software names might be slightly different in Conda or Spack, so have a look on all names that are close.
Copyright, Ashkan Mirzaee | Content is available under CC BY-SA 3.0 | Sourcecode licensed under GPL-3.0