-
Notifications
You must be signed in to change notification settings - Fork 77
Guide to GPUDrive setup on NYU HPC
Clone the gpudrive repository into your /home/$USER directory (info on HPC directories and data management).
git clone --recursive https://github.com/Emerge-Lab/gpudrive.gitMove into the cloned repository folder:
cd gpudrive- Create a directory for overlay files in the
scratchdirectory:
mkdir -p /scratch/$USER/images/gpudrive
cd /scratch/$USER/images/gpudrive- Copy and decompress the overlay image:
cp /scratch/work/public/overlay-fs-ext3/overlay-50G-10M.ext3.gz .
gunzip overlay-50G-10M.ext3.gzThis may take a couple of minutes.
-
- Verify the decompressed overlay image exists:
ls /scratch/$USER/images/gpudriveWhat if I want to use a different overlay image?
To explore all available overlay images:
ls -l /scratch/work/public/overlay-fs-ext3/srun --nodes=1 --tasks-per-node=1 --cpus-per-task=1 --mem=10GB --gres=gpu:1 \
--time=1:00:00 --account=<ASK> --pty /bin/bashOuput:
>>> srun: job XXXXXXX queued and waiting for resources
>>> srun: job XXXXXXX has been allocated resourcesYou will see something like:
[08:33:52 Wed Dec 25 2024] username@ga021.hpc.nyu.edu ~/gpudriveAsk Eugene for your account code if you don't have one yet.
Navigate back to main repo:
cd /home/$USER/gpudriveRun the following to start the container with GPU support and the overlay image:
singularity exec --nv --overlay /scratch/$USER/images/gpudrive/overlay-50G-10M.ext3:rw \
/scratch/work/public/singularity/cuda12.2.2-cudnn8.9.4-devel-ubuntu22.04.3.sif /bin/bashYou should see:
Singularity> Details on Sinularity and overlay images on NYU HPC here.
-
Inside the Singularity container, create a virtual environment:
-
One-off step: Create conda environment with Python 3.11
conda env create -f environment.ymlSee the docs for how to set up a conda environment on Greene.
Why use conda? Currently, conda is an easy way to use a Python version > 3.8.6 on the NYU HPC without Docker.
- Activate conda environment
conda activate gpudriveNow you should see:
(/scratch/username/.conda/gpudrive) Singularity>We use the manual install option to set up GPUDrive, see the README for details.
If successful, you'll see
[100%] Linking CXX executable my_tests
[100%] Built target my_testsLaunch Python:
python3Then run:
import gpudriveIf there are no errors, the installation was successful!
Set up Weights and Biases
-
Set trusted certificates
export SSL_CERT_FILE=$(python -m certifi)
export REQUESTS_CA_BUNDLE=$(python -m certifi)Set up Pufferlib
Install PufferLib with SSL Certificate Fixes
-
Update Certifi Package
Ensure thecertifipackage (provides root certificates) is up-to-date:
pip install --upgrade certifiWhy?
Keeps SSL certificates current to avoid issues with secure connections.
-
Set Trusted Certificates Manually (if needed)
Explicitly set the certificate bundle:
export SSL_CERT_FILE=$(python -m certifi)
export REQUESTS_CA_BUNDLE=$(python -m certifi)- Install pufferlib
pip install git+https://github.com/PufferAI/PufferLib.git@gpudriveRun Self-Play PPO
Use the --help command to see the CLI configurable arguments:
Singularity> python baselines/ppo/ppo_pufferlib.py --helpIn short, use interactive nodes for code development and testing.
Steps:
- Request an interactive compute node, e.g:
srun --nodes=1 --tasks-per-node=1 --cpus-per-task=1 --mem=10GB --gres=gpu:1 \
--time=1:00:00 --account=<account_number> --pty /bin/bashReplace <account_number> with your project number.
- Navigate to repository:
cd /home/$USER/gpudrive- Launch the Singularity image:
singularity exec --nv --overlay /scratch/$USER/images/gpudrive/overlay-50G-10M.ext3:ro \
/scratch/work/public/singularity/cuda12.2.2-cudnn8.9.4-devel-ubuntu22.04.3.sif /bin/bash- Activate the virtual environment:
conda activate gpudrive - Run experiments!
To run the Pufferlib PPO implementation, install Puffer first (not in requirements.yaml)
pip install git+https://github.com/PufferAI/PufferLib.git@gpudrivepython baselines/ippo/ippo_pufferlib.pyIn short, use sbatch for large runs, such as hyperparameter sweeps.
Steps:
- [Optional] Define run configurations and hyperparameters to sweep over in
generate_sbatch.py. Running it stores an sbatch script.
python examples/experiments/scripts/generate_sbatch.py- Submit sbatch jobs using
sbatch <your_sbatch_script>.shThrough the use of job arrays, all the specified runs are launched at once.
Do you encounter issues with one of the steps outlined above? Please reach out in the Emerge lab
#code-helpchannel!