Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix (DeepSpeed) docker image build issue #21002

Merged
merged 4 commits into from
Jan 4, 2023
Merged

Conversation

ydshieh
Copy link
Collaborator

@ydshieh ydshieh commented Jan 4, 2023

What does this PR do?

Currently, the docker image build job Latest PyTorch + DeepSpeed from time to time. The issue occurs after #20788 where apex is recompiled during the build. It seems a resource issue (most likely the memory issue) due to the parallel build (multiple worker). So set MAX_JOB=1 to avoid the failure.

This will increase the build time to 1h30m, but we have to build 2 same image (for daily CI and push CI), therefore 3h, and this is way too long. Previously those 2 images are built sequentially due to some issue, but now it seems the issue is gone and we can build them in parallel.

@@ -117,7 +117,6 @@ jobs:
name: "Latest PyTorch + DeepSpeed (Push CI - Daily Build)"
# Can't run in parallel, otherwise get an error:
# `Error response from daemon: Get "https://registry-1.docker.io/v2/": received unexpected HTTP status: 503 Service Unavailable`
needs: latest-torch-deepspeed-docker
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer need this - and we need to remove this to avoid too long build time due to MAX_JOBS=1 below

@@ -27,7 +27,7 @@ RUN python3 -m pip install torch-tensorrt==1.3.0 --find-links https://github.com
# recompile apex
RUN python3 -m pip uninstall -y apex
RUN git clone https://github.com/NVIDIA/apex
RUN cd apex && python3 -m pip install --global-option="--cpp_ext" --global-option="--cuda_ext" --no-cache -v --disable-pip-version-check .
RUN cd apex && MAX_JOBS=1 python3 -m pip install --global-option="--cpp_ext" --global-option="--cuda_ext" --no-cache -v --disable-pip-version-check .
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid memory issue happened here.

@ydshieh ydshieh changed the title Fix docker image build issue [WIP] Fix docker image build issue Jan 4, 2023
@ydshieh ydshieh requested a review from stas00 January 4, 2023 17:44
@ydshieh ydshieh changed the title [WIP] Fix docker image build issue [WIP] Fix (DeepSpeed) docker image build issue Jan 4, 2023
@ydshieh ydshieh changed the title [WIP] Fix (DeepSpeed) docker image build issue Fix (DeepSpeed) docker image build issue Jan 4, 2023
@ydshieh ydshieh marked this pull request as ready for review January 4, 2023 17:45
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jan 4, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting as I have always seen apex build serially and no matter what I tried I couldn't make it build in parallel. Perhaps this has changed recently.

So yes, your fix is perfect, @ydshieh - and I'd add a comment to why it's there so future readers will not delete it to say make things go faster (as apex takes forever to build).

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
@ydshieh ydshieh merged commit 94db825 into main Jan 4, 2023
@ydshieh ydshieh deleted the fix_docker_image_build branch January 4, 2023 20:28
silverriver pushed a commit to silverriver/transformers that referenced this pull request Jan 6, 2023
* Fix docker image build issue

* remove comment

* Add comment

* Update docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
venkat-natchi pushed a commit to venkat-natchi/transformers that referenced this pull request Jan 22, 2023
* Fix docker image build issue

* remove comment

* Add comment

* Update docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
miyu386 pushed a commit to miyu386/transformers that referenced this pull request Feb 9, 2023
* Fix docker image build issue

* remove comment

* Add comment

* Update docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants