Skip to content

Conversation

@ghukill
Copy link
Collaborator

@ghukill ghukill commented Jan 23, 2026

Purpose and background context

Why these changes are being introduced:

We have decided to move into AWS Batch for running this application in a deployed context. AWS Batch is a container based service and will use ECR images this repository builds. Put succinctly, we have a CPU and a GPU pipeline in AWS Batch, and each pipeline needs a slightly different Docker ECR image built. We also need to support local development, but can leverage any CPU builds locally equally well.

How this addresses that need:

Two Dockerfiles are created Dockerfile-cpu and Dockerfile-gpu. They are very similar and may be refactored to extend a common Dockerfile in the future, but are understandable as-is and successfully build Docker images that work for CPU or GPU enabled contexts.

Other parts of the application have been updated to support dual Dockerfiles:

  • Makefile: dedicated Docker build and push commands
  • pyproject.toml: build specific dependencies, utilized in the build specific Dockerfiles
  • .github workflows: updated to build both CPU and GPU images and utilize GHA built-ins like docker/build-push-action@v6
  • etc.

This PR is a combination of two spike branches, tested and confirmed to build Docker images that target CPU or GPU contexts. There is opportunity for DRYing some of the new code up, but we are opting to keep things verbose and explicit until that time.

How can a reviewer manually see the effects of these changes?

This is straight-forward to test, but time and resource intensive. We now build two Docker images of the following final sizes:

  • CPU: ~4gb
  • GPU: ~16gb

Depending on internet connections, it's a lot of network traffic. And once done, you just have images locally.

To confirm that both CPU and GPU builds work in Github Actions (GHA), and successfully push to AWS ECR, please see this GHA run: https://github.com/MITLibraries/timdex-embeddings/actions/runs/21292978393?pr=38.

To test that local development still works via Docker, it's recommend tdo build the lighter weight CPU build and test that.

1- Build CPU image:

make dist-dev-cpu

2- Retrieve AWS Dev1 Credentials; will need to paste into the following command

3- Run the CPU Docker image locally for a small embedding job:

docker run -it \
-e AWS_ACCESS_KEY_ID="......." \
-e AWS_SECRET_ACCESS_KEY="........" \
-e AWS_SESSION_TOKEN="........" \
timdex-embeddings-dev:latest-arm64-cpu \
--verbose \
create-embeddings \
--strategy full_record \
--dataset-location=s3://timdex-extract-dev-222053980223/dataset \
--run-id=60b76094-5412-4f4b-8bde-24dc3753005a \
--record-limit=10

That demonstrates the CPU Docker image running locally.

Out of scope for this PR, there will be discussions and demonstrations of how AWS Batch jobs work and how these new Docker images in ECR fit into that.

UPDATE: can confirm successful AWS Batch jobs based on these containers for CPU, GPU, and even GPU spot instances.

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES: This repository now builds two Docker images. Build commands are associated with CPU or GPU builds, with commands that perform both.

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced:

We have decided to move into AWS Batch for running this application in a deployed context.  AWS
Batch is a container based service and will use ECR images this repository builds.  Put succinctly,
we have a CPU and a GPU pipeline in AWS Batch, and each pipeline needs a slightly different Docker
ECR image built.  We also need to support local development, but can leverage any CPU builds locally
equally well.

How this addresses that need:

Two Dockerfiles are created `Dockerfile-cpu` and `Dockerfile-gpu`.  They are very similar and may be
refactored to extend a common Dockerfile in the future, but are understandable as-is and successfully
build Docker images that work for CPU or GPU enabled contexts.

Other parts of the application have been updated to support dual Dockerfiles:
- `Makefile`: dedicated Docker build and push commands
- `pyproject.toml`: build specific dependencies, utilized in the build specific Dockerfiles
- `.github` workflows: utilize the new `Makefile` commands
- etc.

This PR is a combination of two spike branches, tested and confirmed to build Docker images that target
CPU or GPU contexts.  There is opportunity for DRYing some of the new code up, but we are opting to keep
things verbose and explicit until that time.

Side effects of this change:
* This repository now builds two Docker images.  Build commands are associated with CPU or GPU builds,
with commands that perform both.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-342
@ghukill ghukill marked this pull request as ready for review January 23, 2026 18:15
@ghukill ghukill requested review from a team and cabutlermit January 23, 2026 18:15
@ghukill
Copy link
Collaborator Author

ghukill commented Jan 23, 2026

Forgot to note in the PR description that this work was a cherry-picking process between two spike branches that myself (https://github.com/MITLibraries/timdex-embeddings/tree/USE-334-spike-aws-batch) and @cabutlermit (https://github.com/MITLibraries/timdex-embeddings/tree/parallel-builds) were working on.

The Makefile and .github updates were 99% from @cabutlermit, with the Docker files and pyproject.toml coming from my work.

@ghukill ghukill requested a review from cabutlermit January 23, 2026 19:06
@cabutlermit
Copy link
Contributor

@ghukill Good news:
Your additional commit triggered the dev-build Action again (which is exactly what should happen). And the builds were significantly faster the second time around: https://github.com/MITLibraries/timdex-embeddings/actions/runs/21297910512

I'm not sure why the CPU build showed no cache usage, but the GPU build showed 44% cache usage and a build time under 4 minutes (instead of the original 10 mins).

Copy link
Contributor

@cabutlermit cabutlermit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In regard to the build/deploy portions (Makefile, GHA workflows for dev & stage) and the documentation (README, ADR, other docs), I approve!

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I did not run the test code, from a high-level review, these changes/the code makes sense. I expect we'll be more comfortable working with these Docker images in upcoming work.

Just had one small question for ya'!


# Install package into system python
RUN uv pip install --system .
RUN uv pip install --group dlc_arm64_cpu --system .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, what is the effect of this if the list of dependencies in pyproject.toml is empty? 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question.

The short answer, is that the empty list has no effect.

The longer answer is that originally during this work I was thinking the CPU and GPU Docker images may need different dependencies. Turns out that was not the case, and we could rely on the torch version in each base image.

But I figured leaving this structure in place couldn't hurt in the event we do eventually want to support different dependencies for CPU vs GPU. Dependencies are heavy and tricky for ML work, and I think we'd be well served to assume there will be easily handle fairly frequent and finicky dependency juggling to perform.

@ghukill ghukill merged commit 4e2643e into main Jan 26, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants