USE 342 - Build CPU and GPU Docker images #38

ghukill · 2026-01-23T14:58:30Z

Purpose and background context

Why these changes are being introduced:

We have decided to move into AWS Batch for running this application in a deployed context. AWS Batch is a container based service and will use ECR images this repository builds. Put succinctly, we have a CPU and a GPU pipeline in AWS Batch, and each pipeline needs a slightly different Docker ECR image built. We also need to support local development, but can leverage any CPU builds locally equally well.

How this addresses that need:

Two Dockerfiles are created Dockerfile-cpu and Dockerfile-gpu. They are very similar and may be refactored to extend a common Dockerfile in the future, but are understandable as-is and successfully build Docker images that work for CPU or GPU enabled contexts.

Other parts of the application have been updated to support dual Dockerfiles:

Makefile: dedicated Docker build and push commands
pyproject.toml: build specific dependencies, utilized in the build specific Dockerfiles
.github workflows: updated to build both CPU and GPU images and utilize GHA built-ins like docker/build-push-action@v6
etc.

This PR is a combination of two spike branches, tested and confirmed to build Docker images that target CPU or GPU contexts. There is opportunity for DRYing some of the new code up, but we are opting to keep things verbose and explicit until that time.

How can a reviewer manually see the effects of these changes?

This is straight-forward to test, but time and resource intensive. We now build two Docker images of the following final sizes:

CPU: ~4gb
GPU: ~16gb

Depending on internet connections, it's a lot of network traffic. And once done, you just have images locally.

To confirm that both CPU and GPU builds work in Github Actions (GHA), and successfully push to AWS ECR, please see this GHA run: https://github.com/MITLibraries/timdex-embeddings/actions/runs/21292978393?pr=38.

To test that local development still works via Docker, it's recommend tdo build the lighter weight CPU build and test that.

1- Build CPU image:

make dist-dev-cpu

2- Retrieve AWS Dev1 Credentials; will need to paste into the following command

3- Run the CPU Docker image locally for a small embedding job:

docker run -it \
-e AWS_ACCESS_KEY_ID="......." \
-e AWS_SECRET_ACCESS_KEY="........" \
-e AWS_SESSION_TOKEN="........" \
timdex-embeddings-dev:latest-arm64-cpu \
--verbose \
create-embeddings \
--strategy full_record \
--dataset-location=s3://timdex-extract-dev-222053980223/dataset \
--run-id=60b76094-5412-4f4b-8bde-24dc3753005a \
--record-limit=10

That demonstrates the CPU Docker image running locally.

Out of scope for this PR, there will be discussions and demonstrations of how AWS Batch jobs work and how these new Docker images in ECR fit into that.

UPDATE: can confirm successful AWS Batch jobs based on these containers for CPU, GPU, and even GPU spot instances.

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES: This repository now builds two Docker images. Build commands are associated with CPU or GPU builds, with commands that perform both.

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/USE-342

Code review

Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced: We have decided to move into AWS Batch for running this application in a deployed context. AWS Batch is a container based service and will use ECR images this repository builds. Put succinctly, we have a CPU and a GPU pipeline in AWS Batch, and each pipeline needs a slightly different Docker ECR image built. We also need to support local development, but can leverage any CPU builds locally equally well. How this addresses that need: Two Dockerfiles are created `Dockerfile-cpu` and `Dockerfile-gpu`. They are very similar and may be refactored to extend a common Dockerfile in the future, but are understandable as-is and successfully build Docker images that work for CPU or GPU enabled contexts. Other parts of the application have been updated to support dual Dockerfiles: - `Makefile`: dedicated Docker build and push commands - `pyproject.toml`: build specific dependencies, utilized in the build specific Dockerfiles - `.github` workflows: utilize the new `Makefile` commands - etc. This PR is a combination of two spike branches, tested and confirmed to build Docker images that target CPU or GPU contexts. There is opportunity for DRYing some of the new code up, but we are opting to keep things verbose and explicit until that time. Side effects of this change: * This repository now builds two Docker images. Build commands are associated with CPU or GPU builds, with commands that perform both. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/USE-342

ghukill · 2026-01-23T18:29:44Z

Forgot to note in the PR description that this work was a cherry-picking process between two spike branches that myself (https://github.com/MITLibraries/timdex-embeddings/tree/USE-334-spike-aws-batch) and @cabutlermit (https://github.com/MITLibraries/timdex-embeddings/tree/parallel-builds) were working on.

The Makefile and .github updates were 99% from @cabutlermit, with the Docker files and pyproject.toml coming from my work.

README.md

cabutlermit · 2026-01-23T19:18:01Z

@ghukill Good news:
Your additional commit triggered the dev-build Action again (which is exactly what should happen). And the builds were significantly faster the second time around: https://github.com/MITLibraries/timdex-embeddings/actions/runs/21297910512

I'm not sure why the CPU build showed no cache usage, but the GPU build showed 44% cache usage and a build time under 4 minutes (instead of the original 10 mins).

cabutlermit

In regard to the build/deploy portions (Makefile, GHA workflows for dev & stage) and the documentation (README, ADR, other docs), I approve!

jonavellecuerdo

While I did not run the test code, from a high-level review, these changes/the code makes sense. I expect we'll be more comfortable working with these Docker images in upcoming work.

Just had one small question for ya'!

jonavellecuerdo · 2026-01-26T17:22:59Z

Dockerfile-cpu


 # Install package into system python
-RUN uv pip install --system .
+RUN uv pip install --group dlc_arm64_cpu --system .


Hmm, what is the effect of this if the list of dependencies in pyproject.toml is empty? 🤔

Great question.

The short answer, is that the empty list has no effect.

The longer answer is that originally during this work I was thinking the CPU and GPU Docker images may need different dependencies. Turns out that was not the case, and we could rely on the torch version in each base image.

But I figured leaving this structure in place couldn't hurt in the event we do eventually want to support different dependencies for CPU vs GPU. Dependencies are heavy and tricky for ML work, and I think we'd be well served to ~~assume there will be~~ easily handle fairly frequent and finicky dependency juggling to perform.

ghukill added 2 commits January 23, 2026 09:56

Add updated dev and stage GHA workflows

27943c5

ghukill marked this pull request as ready for review January 23, 2026 18:15

ghukill requested review from a team and cabutlermit January 23, 2026 18:15

cabutlermit reviewed Jan 23, 2026

View reviewed changes

README.md Outdated Show resolved Hide resolved

Additional CD documentation

fbe2e42

ghukill requested a review from cabutlermit January 23, 2026 19:06

cabutlermit approved these changes Jan 23, 2026

View reviewed changes

Prevent GHA actions for README updates

701b3e3

ghukill mentioned this pull request Jan 26, 2026

USE 351 - create embeddings for source current records #39

Merged

jonavellecuerdo self-requested a review January 26, 2026 16:59

jonavellecuerdo approved these changes Jan 26, 2026

View reviewed changes

ghukill merged commit 4e2643e into main Jan 26, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

USE 342 - Build CPU and GPU Docker images #38

USE 342 - Build CPU and GPU Docker images #38

Uh oh!

ghukill commented Jan 23, 2026 •

edited

Loading

Uh oh!

ghukill commented Jan 23, 2026

Uh oh!

Uh oh!

cabutlermit commented Jan 23, 2026

Uh oh!

cabutlermit left a comment

Uh oh!

jonavellecuerdo left a comment

Uh oh!

jonavellecuerdo Jan 26, 2026

Uh oh!

ghukill Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

USE 342 - Build CPU and GPU Docker images #38

USE 342 - Build CPU and GPU Docker images #38

Uh oh!

Conversation

ghukill commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Code review

Uh oh!

ghukill commented Jan 23, 2026

Uh oh!

Uh oh!

cabutlermit commented Jan 23, 2026

Uh oh!

cabutlermit left a comment

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo left a comment

Choose a reason for hiding this comment

Uh oh!

jonavellecuerdo Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

ghukill Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ghukill commented Jan 23, 2026 •

edited

Loading