-
Notifications
You must be signed in to change notification settings - Fork 0
USE 342 - Build CPU and GPU Docker images #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why these changes are being introduced: We have decided to move into AWS Batch for running this application in a deployed context. AWS Batch is a container based service and will use ECR images this repository builds. Put succinctly, we have a CPU and a GPU pipeline in AWS Batch, and each pipeline needs a slightly different Docker ECR image built. We also need to support local development, but can leverage any CPU builds locally equally well. How this addresses that need: Two Dockerfiles are created `Dockerfile-cpu` and `Dockerfile-gpu`. They are very similar and may be refactored to extend a common Dockerfile in the future, but are understandable as-is and successfully build Docker images that work for CPU or GPU enabled contexts. Other parts of the application have been updated to support dual Dockerfiles: - `Makefile`: dedicated Docker build and push commands - `pyproject.toml`: build specific dependencies, utilized in the build specific Dockerfiles - `.github` workflows: utilize the new `Makefile` commands - etc. This PR is a combination of two spike branches, tested and confirmed to build Docker images that target CPU or GPU contexts. There is opportunity for DRYing some of the new code up, but we are opting to keep things verbose and explicit until that time. Side effects of this change: * This repository now builds two Docker images. Build commands are associated with CPU or GPU builds, with commands that perform both. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-342
|
Forgot to note in the PR description that this work was a cherry-picking process between two spike branches that myself (https://github.com/MITLibraries/timdex-embeddings/tree/USE-334-spike-aws-batch) and @cabutlermit (https://github.com/MITLibraries/timdex-embeddings/tree/parallel-builds) were working on. The |
|
@ghukill Good news: I'm not sure why the CPU build showed no cache usage, but the GPU build showed 44% cache usage and a build time under 4 minutes (instead of the original 10 mins). |
cabutlermit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In regard to the build/deploy portions (Makefile, GHA workflows for dev & stage) and the documentation (README, ADR, other docs), I approve!
jonavellecuerdo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I did not run the test code, from a high-level review, these changes/the code makes sense. I expect we'll be more comfortable working with these Docker images in upcoming work.
Just had one small question for ya'!
|
|
||
| # Install package into system python | ||
| RUN uv pip install --system . | ||
| RUN uv pip install --group dlc_arm64_cpu --system . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, what is the effect of this if the list of dependencies in pyproject.toml is empty? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question.
The short answer, is that the empty list has no effect.
The longer answer is that originally during this work I was thinking the CPU and GPU Docker images may need different dependencies. Turns out that was not the case, and we could rely on the torch version in each base image.
But I figured leaving this structure in place couldn't hurt in the event we do eventually want to support different dependencies for CPU vs GPU. Dependencies are heavy and tricky for ML work, and I think we'd be well served to assume there will be easily handle fairly frequent and finicky dependency juggling to perform.
Purpose and background context
Why these changes are being introduced:
We have decided to move into AWS Batch for running this application in a deployed context. AWS Batch is a container based service and will use ECR images this repository builds. Put succinctly, we have a CPU and a GPU pipeline in AWS Batch, and each pipeline needs a slightly different Docker ECR image built. We also need to support local development, but can leverage any CPU builds locally equally well.
How this addresses that need:
Two Dockerfiles are created
Dockerfile-cpuandDockerfile-gpu. They are very similar and may be refactored to extend a common Dockerfile in the future, but are understandable as-is and successfully build Docker images that work for CPU or GPU enabled contexts.Other parts of the application have been updated to support dual Dockerfiles:
Makefile: dedicated Docker build and push commandspyproject.toml: build specific dependencies, utilized in the build specific Dockerfiles.githubworkflows: updated to build both CPU and GPU images and utilize GHA built-ins likedocker/build-push-action@v6This PR is a combination of two spike branches, tested and confirmed to build Docker images that target CPU or GPU contexts. There is opportunity for DRYing some of the new code up, but we are opting to keep things verbose and explicit until that time.
How can a reviewer manually see the effects of these changes?
This is straight-forward to test, but time and resource intensive. We now build two Docker images of the following final sizes:
Depending on internet connections, it's a lot of network traffic. And once done, you just have images locally.
To confirm that both CPU and GPU builds work in Github Actions (GHA), and successfully push to AWS ECR, please see this GHA run: https://github.com/MITLibraries/timdex-embeddings/actions/runs/21292978393?pr=38.
To test that local development still works via Docker, it's recommend tdo build the lighter weight CPU build and test that.
1- Build CPU image:
2- Retrieve AWS Dev1 Credentials; will need to paste into the following command
3- Run the CPU Docker image locally for a small embedding job:
That demonstrates the CPU Docker image running locally.
Out of scope for this PR, there will be discussions and demonstrations of how AWS Batch jobs work and how these new Docker images in ECR fit into that.
UPDATE: can confirm successful AWS Batch jobs based on these containers for CPU, GPU, and even GPU spot instances.
Includes new or updated dependencies?
YES
Changes expectations for external applications?
YES: This repository now builds two Docker images. Build commands are associated with CPU or GPU builds, with commands that perform both.
What are the relevant tickets?
Code review