-
Notifications
You must be signed in to change notification settings - Fork 511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Examples] Add docker compose example to run multiple containers #2745
Conversation
This is great to have @romilbhardwaj! One issue with running it on Azure:
This may be related to our default GPU image on Azure having CUDA too outdated. Is it a quick fix? |
Ah good catch - we do need to update our azure image (#2751). For this PR, I've changed the version to 11.5.2 and tested it works on aws, az and gcp. |
services: | ||
gpu-app1: | ||
image: nvidia/cuda:11.5.2-runtime-ubuntu20.04 | ||
command: nvidia-smi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added -l 1
to here and L17 for nvidia-smi to loop forever.
It appears both containers print the same GPU ID. Note the SkyPilot task has 2 GPUs assigned, so GPUs 0 and 1 are available.
Is there any env var (CUDA_VISIBLE_DEVICES?) we can add to this file to show how to distribute the containers to GPUs 0 and 1 respectively? Can even be a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - I've changed from count
to explicit device_id
. Note that nvidia-docker remaps device ids, so from within gpu-app2
container the GPU ID visible will be ['0'] (though it maps to physical device 1). Also added this as a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Simple example showing how to use docker compose to launch multiple containers on a SkyPilot cluster.
sky launch -c myclus compose_example.yaml