-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix (DeepSpeed) docker image build issue #21002
Conversation
@@ -117,7 +117,6 @@ jobs: | |||
name: "Latest PyTorch + DeepSpeed (Push CI - Daily Build)" | |||
# Can't run in parallel, otherwise get an error: | |||
# `Error response from daemon: Get "https://registry-1.docker.io/v2/": received unexpected HTTP status: 503 Service Unavailable` | |||
needs: latest-torch-deepspeed-docker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer need this - and we need to remove this to avoid too long build time due to MAX_JOBS=1
below
@@ -27,7 +27,7 @@ RUN python3 -m pip install torch-tensorrt==1.3.0 --find-links https://github.com | |||
# recompile apex | |||
RUN python3 -m pip uninstall -y apex | |||
RUN git clone https://github.com/NVIDIA/apex | |||
RUN cd apex && python3 -m pip install --global-option="--cpp_ext" --global-option="--cuda_ext" --no-cache -v --disable-pip-version-check . | |||
RUN cd apex && MAX_JOBS=1 python3 -m pip install --global-option="--cpp_ext" --global-option="--cuda_ext" --no-cache -v --disable-pip-version-check . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to avoid memory issue happened here.
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting as I have always seen apex
build serially and no matter what I tried I couldn't make it build in parallel. Perhaps this has changed recently.
So yes, your fix is perfect, @ydshieh - and I'd add a comment to why it's there so future readers will not delete it to say make things go faster (as apex takes forever to build).
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
* Fix docker image build issue * remove comment * Add comment * Update docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
* Fix docker image build issue * remove comment * Add comment * Update docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
* Fix docker image build issue * remove comment * Add comment * Update docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
What does this PR do?
Currently, the docker image build job Latest PyTorch + DeepSpeed from time to time. The issue occurs after #20788 where
apex
is recompiled during the build. It seems a resource issue (most likely the memory issue) due to the parallel build (multiple worker). So setMAX_JOB=1
to avoid the failure.This will increase the build time to
1h30m
, but we have to build 2 same image (for daily CI and push CI), therefore 3h, and this is way too long. Previously those 2 images are built sequentially due to some issue, but now it seems the issue is gone and we can build them in parallel.