Skip to content

varuniyer/nrp-k8s-setup-template

Repository files navigation

K8s (Kubernetes) Setup Template

Overview

This repository is a template for running Python projects on GPU nodes in NRP Nautilus. Only use this template when more conveniently available resources fail to meet your needs. config_k8s.py is a script that automatically generates a K8s job file and secrets based on your inputs. The instructions below provide a workflow for building and pushing Docker images to the NRP's GitLab container registry. You should follow the steps below inside either a Coder workspace or your own.

Prerequisites

Getting started

Follow these steps in your terminal:

  1. Clone your fork and enter it with the following command:

    git clone ssh://git@gitlab-ssh.nrp-nautilus.io:30622/<GITLAB_USERNAME>/<REPO_NAME>.git && cd <REPO_NAME>
  2. Generate a K8s job file named your_job.yml with the following command:

    python config_k8s.py --netid <NET_ID> --output-path your_job.yml --pat <GITLAB_PAT> --dt-username <DEPLOY_TOKEN_USERNAME> --dt-password <DEPLOY_TOKEN_PASSWORD>
    • --pat, --dt-username, and --dt-password are only required the first time you run this script
      • You may pass them in again to modify the values of their corresponding K8s secrets
  3. Update pyproject.toml to include your project's Python dependencies:

    • Run uv sync to install them in a new virtualenv
    • Activate the virtualenv with source .venv/bin/activate
  4. Add your Python code to the repo:

    • Place commands to run your code in entrypoint.sh
    • Commit and push all additions and changes
  5. Build your container image:

    • A build job will automatically trigger when you push commits that modify any of the following:
      • pyproject.toml
      • Dockerfile
      • .gitlab-ci.yml
    • Allow 30-90 minutes for the build to complete (only when the above files are modified)
    • Navigate to "Build" → "Jobs" in GitLab's web UI to monitor the build progress
    • Note: Changes to your Python code, entrypoint.sh, or other project files do NOT trigger a rebuild and will be pulled directly via git when your job runs
  6. Modify the corresponding lines in your_job.yml to suit your needs:

    • The job name (line 7)
    • Environment variables inside your container's env section (line 30)
    • Your container's resource requests/limits (line 34)
    • The branch your job will pull code from (line 73)
  7. Run your job with the following command:

    kubectl create -f your_job.yml

Monitoring and Troubleshooting

Once your job is running, follow these steps to monitor its performance and troubleshoot runtime errors:

  1. Check job status and logs:

    • Run kubectl get pods | grep <JOB_NAME> to get the name of the pod associated with your job
    • Run kubectl logs <POD_NAME> to view your job's output and check for errors
    • Run kubectl describe pod <POD_NAME> to get detailed information about the pod's status and events
    • Run kubectl exec -it <POD_NAME> -- /bin/bash to enter a shell in your (actively running) pod
  2. Monitor resource usage using kubectl:

    • Run kubectl top <POD_NAME> to monitor CPU/RAM utilization
    • Run kubectl exec -it <POD_NAME> -- nvidia-smi to monitor GPU utilization
  3. Monitor resource usage externally using Grafana dashboards:

FAQ

Which files should I modify for my own project?

Modify the following files along with your Python code:

  • entrypoint.sh runs your code when the container starts
  • pyproject.toml contains Python dependencies
  • Dockerfile is used to build the Docker image
  • your_job.yml specifies the K8s job configuration

What if I need to install other packages?

Additional packages may be listed in the Dockerfile (line 14).

How can I prevent my CI/CD pipeline from timing out?

Remove unnecessary dependencies from both pyproject.toml and the Dockerfile. If this is not enough, you may extend the timeout in .gitlab-ci.yml (line 14).

Why not include configuration for a PVC (to access CephFS) or rclone (to access Ceph S3)?

NRP-provided storage has usage restrictions. Notably, even accidentally storing Python dependencies in Ceph may result in a temporary ban from accessing Nautilus resources. Instead, use:

Given the presence of these alternatives (which are not subject to the same usage restrictions), this template does not support NRP-provided storage.

Where can I learn more about using Kubernetes on Nautilus?

Check out the NRP's official documentation for more information.

About

K8s setup template for NRP Nautilus (Mirror)

Topics

Resources

License

Stars

Watchers

Forks