This repository is a template for running Python projects on GPU nodes in NRP Nautilus. Only use this template when more conveniently available resources fail to meet your needs. config_k8s.py is a script that automatically generates a K8s job file and secrets based on your inputs. The instructions below provide a workflow for building and pushing Docker images to the NRP's GitLab container registry. You should follow the steps below inside either a Coder workspace or your own.
- Create an SSH key in your workspace and add it to GitLab for both authentication and signing
- Fork this repository privately on the NRP's GitLab instance
- Optionally, create a new branch in your fork and follow the steps in the Getting started section on this branch
- Install
kubectl,git, anduv - Save the NRP-provided K8s config to
~/.kube/config - Create a Personal Access Token with the
read_repositoryscope - Create a deploy token for your fork with the
read_registryscope
Follow these steps in your terminal:
-
Clone your fork and enter it with the following command:
git clone ssh://git@gitlab-ssh.nrp-nautilus.io:30622/<GITLAB_USERNAME>/<REPO_NAME>.git && cd <REPO_NAME>
-
Generate a K8s job file named
your_job.ymlwith the following command:python config_k8s.py --netid <NET_ID> --output-path your_job.yml --pat <GITLAB_PAT> --dt-username <DEPLOY_TOKEN_USERNAME> --dt-password <DEPLOY_TOKEN_PASSWORD>
--pat,--dt-username, and--dt-passwordare only required the first time you run this script- You may pass them in again to modify the values of their corresponding K8s secrets
-
Update
pyproject.tomlto include your project's Python dependencies:- Run
uv syncto install them in a new virtualenv - Activate the virtualenv with
source .venv/bin/activate
- Run
-
Add your Python code to the repo:
- Place commands to run your code in
entrypoint.sh - Commit and push all additions and changes
- Place commands to run your code in
-
Build your container image:
- A build job will automatically trigger when you push commits that modify any of the following:
pyproject.tomlDockerfile.gitlab-ci.yml
- Allow 30-90 minutes for the build to complete (only when the above files are modified)
- Navigate to "Build" → "Jobs" in GitLab's web UI to monitor the build progress
- Note: Changes to your Python code,
entrypoint.sh, or other project files do NOT trigger a rebuild and will be pulled directly via git when your job runs
- A build job will automatically trigger when you push commits that modify any of the following:
-
Modify the corresponding lines in
your_job.ymlto suit your needs: -
Run your job with the following command:
kubectl create -f your_job.yml
Once your job is running, follow these steps to monitor its performance and troubleshoot runtime errors:
-
Check job status and logs:
- Run
kubectl get pods | grep <JOB_NAME>to get the name of the pod associated with your job - Run
kubectl logs <POD_NAME>to view your job's output and check for errors - Run
kubectl describe pod <POD_NAME>to get detailed information about the pod's status and events - Run
kubectl exec -it <POD_NAME> -- /bin/bashto enter a shell in your (actively running) pod
- Run
-
Monitor resource usage using
kubectl:- Run
kubectl top <POD_NAME>to monitor CPU/RAM utilization - Run
kubectl exec -it <POD_NAME> -- nvidia-smito monitor GPU utilization
- Run
-
Monitor resource usage externally using Grafana dashboards:
Modify the following files along with your Python code:
entrypoint.shruns your code when the container startspyproject.tomlcontains Python dependenciesDockerfileis used to build the Docker imageyour_job.ymlspecifies the K8s job configuration
Additional packages may be listed in the Dockerfile (line 14).
Remove unnecessary dependencies from both pyproject.toml and the Dockerfile. If this is not enough, you may extend the timeout in .gitlab-ci.yml (line 14).
NRP-provided storage has usage restrictions. Notably, even accidentally storing Python dependencies in Ceph may result in a temporary ban from accessing Nautilus resources. Instead, use:
- Hugging Face Hub to efficiently store both datasets and model checkpoints
- wandb or Comet to log experiment results
Given the presence of these alternatives (which are not subject to the same usage restrictions), this template does not support NRP-provided storage.
Check out the NRP's official documentation for more information.