Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pathogen-embed to the base image #221

Merged
merged 2 commits into from
Jun 24, 2024
Merged

Add pathogen-embed to the base image #221

merged 2 commits into from
Jun 24, 2024

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Jun 20, 2024

Description of proposed changes

Adds pathogen-embed to base image, so we can use its tools for analysis of reassortment in our influenza builds.

Related issue(s)

nextstrain/conda-base#71

Checklist

  • Checks pass

@@ -312,6 +312,7 @@ RUN if [[ "$TARGETPLATFORM" == linux/arm64 ]]; then \
; \
fi

RUN pip3 install pathogen-embed==2.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(non-blocking)

Noting from the run summary that this added 8.5 minutes to the build time:

[linux/arm64] RUN pip3 install pathogen-embed==2.0.0     | 496.7s (61.9%)  ████████████████████
[linux/amd64] RUN pip3 install pathogen-embed==2.0.0     | 17.0s (2.1%)    ▋

I'm not worried since this dependency is pinned, meaning the cached result will be used most of the time and that the increased build time would only be noticeable during (infrequent) cache misses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one change also adds ~57 MB to the image file size... 😕

I expected an increase, since we need to install scikit-learn, HDBSCAN, and UMAP, but it is kind of a bummer.

@huddlej
Copy link
Contributor Author

huddlej commented Jun 20, 2024

Even though the CI passed, this PR doesn't currently work as expected. When I download the image for this branch and run a Nextstrain shell like so, the image is missing the executables for pathogen-distance, pathogen-embed, etc.:

$ nextstrain shell --docker --image docker.io/nextstrain/base:branch-add-pathogen-embed .
~/build $ which pathogen-distance
~/build $ which pathogen-embed
~/build $ 

However, the Python package is installed, since I can load the Python REPL from the Nextstrain shell and import it:

~/build $ python
>>> import pathogen_embed
>>>

@huddlej
Copy link
Contributor Author

huddlej commented Jun 20, 2024

I forgot this step:

docker-base/Dockerfile

Lines 433 to 452 in 3f15ce3

# Add installed Python scripts that we need.
#
# XXX TODO: This isn't great. It's prone to needing manual updates because it
# doesn't pull in scripts which got installed but that we don't list. Consider
# alternatives (like installing the deps into an empty prefix tree and then
# copying the whole prefix tree, or using pip's installed-files.txt manifests
# as the set of things to copy) in the future if the maintenance burden becomes
# troublesome or excessive.
# -trs, 15 June 2018
COPY --from=builder-target-platform \
/usr/local/bin/augur \
/usr/local/bin/aws \
/usr/local/bin/envdir \
/usr/local/bin/nextstrain \
/usr/local/bin/pangolin \
/usr/local/bin/pangolearn.smk \
/usr/local/bin/scorpio \
/usr/local/bin/snakemake \
/usr/local/bin/treetime \
/usr/local/bin/

Fixing in the next commit.

Adds paths for the three command line scripts provided by the
pathogen-embed Python package which is the primary interface to that
package.
@huddlej
Copy link
Contributor Author

huddlej commented Jun 22, 2024

I confirmed that copying the scripts in the last commit properly provided each of the pathogen-embed tools. I additionally tested that these tools worked in this image with pathogen-embed's cram tests (from the pathogen-embed repo) like so:

$ nextstrain shell --docker --image docker.io/nextstrain/base:branch-add-pathogen-embed .
~/build $ python -m pip install cram
~/build $ ~/.local/bin/cram --shell=/bin/bash tests
.............................
# Ran 29 tests, 0 skipped, 0 failed.
~/build $

I'll plan to merge this PR and the sibling conda-base PR on Monday.

@huddlej huddlej merged commit 009b7af into master Jun 24, 2024
31 checks passed
@huddlej huddlej deleted the add-pathogen-embed branch June 24, 2024 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants