This guide is a good starting point for the requirements.
- Install docker and sudoless docker. More info on rcp doc on containers and doc on preparing environments
- Install kubernetes
- follow the kubernetes instructions in the wiki.rcp.epfl.ch to install kubernetes
- if running
kubectl version
gives aThe connection to the server localhost:8080 was refused...
message, you might need to create a.kube/config
file and runcurl https://wiki.rcp.epfl.ch/public/files/kube-config.yaml -o ~/.kube/config && chmod 600 ~/.kube/config
to configure the cluster
- Install runai using the instructions in the wiki
- login to the RunAI platform using
runai login
. You should be able to runrunai whoami
afterwards
- login to the RunAI platform using
registry.rcp.epfl.ch
- go to registry.rcp.epfl.ch and login
- create your project with the UI. Your project should be
lts4-$USERNAME
- login with docker to the registry by
docker login registry.rcp.epfl.ch
- (Optional) Create a wandb secret and name it
wandb-secret
. This is needed for the wandb integration. Follow this link: https://wiki.rcp.epfl.ch/en/home/CaaS/FAQ/how-to-use-secret-wandb - For Visual Studio Code integration, follow this link: https://wiki.rcp.epfl.ch/en/home/CaaS/FAQ/how-to-vscode
haas
- Make sure you have access to the
haas
storage by runningssh $USERNAME@haas001.rcp.epfl.ch
(orssh $USERNAME@jumphost.rcp.epfl.ch
, which is the recommended host) - go to your mounted volume (should be
/mnt/lts4/scratch
for most) and create a directory with your name viamkdir -p /mnt/lts4/scratch/home/$USERNAME
. The launch script assumes that you have done so.
- Make sure you have access to the
Now you can proceed with the next steps, building your docker image, pushing it to the registry and launching jobs.
First, you must recover and save your LDAP accreditation codes. You can use the ldap_fetch.sh
script as follows, where GASPAR
is your EPFL username:
./ldap_fetch.sh GASPAR
This will store your credentials in the ~/.profile
file, and make them available at startup by sourcing them it to your .bashrc
or .zshrc
files.
It will also define the RUNAI_OPTIONS
environment variable, which will allow you to launch jobs with runai submit
.
The base image uses a specific pytorch image for reproducibility, adds several libraries, adds the current user.
If you want to add more template images, create a directory in the dockerfiles
directory and add a Dockerfile
there.
Then, make a PR.
Then, run the following line to push your image to the registry (if you only want to build the image without pushing it to the registry, omit the push
).
# Before running this command, make sure to change $GASPAR to your epfl username, or declare it as
# an environment variable
./publish.sh --path=dockerfiles/base \
--img=NAME_OF_YOUR_IMAGE \
--version=1 \
--push=True
The official way to launch and interact with jobs is thought the RunAI command line
interface.
In particular using runai submit
, whose available options are documented here.
You need to use the $RUNAI_OPTIONS
, which is set in your ~/.profile
by the ldap_fetch.sh
script.
Remark: If you're not a permanent member of LTS4 (PhD or Postdoc), verify that your
EPFL_SCRATCH_HOME
is correctly set:$ echo $EPFL_SCRATCH_HOME > /mnt/lts4/scratch/students/<gaspar>
# You can specify a fraction of the GPU to use with the `--gpus` flag
runai submit $RUNAI_OPTIONS \
--name <name-job> \
--image registry.rcp.epfl.ch/lts4-$EPFL_USER/<name-image> \
--gpus 0.8 \
--interactive -- sleep infinity
Supposing that you want to launch the script train.py
in the scr
directory of your scratch home
folder (stored on haas
), with arguments --arg1=1 --arg2=2
you can use the following command:
runai submit $RUNAI_OPTIONS \
--name <name-job> \
--gpus 1 \
--image registry.rcp.epfl.ch/lts4-$EPFL_USER/<name-image> \
--command -- /bin/bash -c 'cd $SCRATCH_HOME && python src/train.py --arg1=1 --arg2=2'
More detailed information coming soon, take a look at the launch.py
script for now.
To use the launch script from anywhere, you can add an alias to your .bashrc
or .zshrc
file.
# Add the following line to your .bashrc or .zshrc
# ...for bash
echo 'alias rcplaunch="python /path/to/launch.py"' >> ~/.bashrc
source ~/.bashrc
# ...for zsh
echo 'alias rcplaunch="python /path/to/launch.py"' >> ~/.zshrc
source ~/.zshrc
Remark: If you're not a permanent member of LTS4 (PhD or Postdoc), include the flag
--student
in the command lines below.
# You can specify a fraction of the GPU to use with the `--gpus` flag
python launch.py \
--name=<name-job> \
--gpus=0.8 \
--image=registry.rcp.epfl.ch/lts4-$EPFL_USER/<name-image> \
--interactive
python launch.py \
--name=NAME_OF_JOB \
--gpus=1 \
--cpus=20 \
--image=registry.rcp.epfl.ch/lts4-$EPFL_USER/<name-image> \
--command='cd path/to/code && python train.py --arg1=1 --arg2=2'
python launch.py \
--name=NAME_OF_JOB \
--gpus=1 \
--cpus=20 \
--image=registry.rcp.epfl.ch/lts4-$EPFL_USER/<name-image> \
--command='cd path/to/code && python train.py --arg1=1 --arg2=2' \
--dry-run
The status of a job can be checked with the command runai logs job-name
. If a run fails, runai will launch it again up to 6 times in pods with the name job-name-0-n
. To check the logs of a specific run, you can run runai logs job-name --pod job-name-0-n
, where n
is the number of the pod you want to access.
This guide builds upon https://github.com/epfml/getting-started.