Fix (DeepSpeed) docker image build issue #21002

ydshieh · 2023-01-04T17:32:21Z

What does this PR do?

Currently, the docker image build job Latest PyTorch + DeepSpeed from time to time. The issue occurs after #20788 where apex is recompiled during the build. It seems a resource issue (most likely the memory issue) due to the parallel build (multiple worker). So set MAX_JOB=1 to avoid the failure.

This will increase the build time to 1h30m, but we have to build 2 same image (for daily CI and push CI), therefore 3h, and this is way too long. Previously those 2 images are built sequentially due to some issue, but now it seems the issue is gone and we can build them in parallel.

ydshieh · 2023-01-04T17:34:12Z

.github/workflows/build-docker-images.yml

@@ -117,7 +117,6 @@ jobs:
    name: "Latest PyTorch + DeepSpeed (Push CI - Daily Build)"
    # Can't run in parallel, otherwise get an error:
    #   `Error response from daemon: Get "https://registry-1.docker.io/v2/": received unexpected HTTP status: 503 Service Unavailable`
-    needs: latest-torch-deepspeed-docker


No longer need this - and we need to remove this to avoid too long build time due to MAX_JOBS=1 below

ydshieh · 2023-01-04T17:34:28Z

docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile

@@ -27,7 +27,7 @@ RUN python3 -m pip install torch-tensorrt==1.3.0 --find-links https://github.com
 # recompile apex
 RUN python3 -m pip uninstall -y apex
 RUN git clone https://github.com/NVIDIA/apex
-RUN cd apex && python3 -m pip install --global-option="--cpp_ext" --global-option="--cuda_ext" --no-cache -v --disable-pip-version-check .
+RUN cd apex && MAX_JOBS=1 python3 -m pip install --global-option="--cpp_ext" --global-option="--cuda_ext" --no-cache -v --disable-pip-version-check .


to avoid memory issue happened here.

HuggingFaceDocBuilderDev · 2023-01-04T17:50:02Z

The documentation is not available anymore as the PR was closed or merged.

stas00

This is interesting as I have always seen apex build serially and no matter what I tried I couldn't make it build in parallel. Perhaps this has changed recently.

So yes, your fix is perfect, @ydshieh - and I'd add a comment to why it's there so future readers will not delete it to say make things go faster (as apex takes forever to build).

docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Fix docker image build issue * remove comment * Add comment * Update docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

Fix docker image build issue

0c7e1d9

ydshieh commented Jan 4, 2023

View reviewed changes

remove comment

9730804

ydshieh changed the title ~~Fix docker image build issue~~ [WIP] Fix docker image build issue Jan 4, 2023

ydshieh requested a review from stas00 January 4, 2023 17:44

ydshieh changed the title ~~[WIP] Fix docker image build issue~~ [WIP] Fix (DeepSpeed) docker image build issue Jan 4, 2023

ydshieh changed the title ~~[WIP] Fix (DeepSpeed) docker image build issue~~ Fix (DeepSpeed) docker image build issue Jan 4, 2023

ydshieh marked this pull request as ready for review January 4, 2023 17:45

stas00 approved these changes Jan 4, 2023

View reviewed changes

Add comment

9751156

stas00 reviewed Jan 4, 2023

View reviewed changes

docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile Outdated Show resolved Hide resolved

Update docker/transformers-pytorch-deepspeed-latest-gpu/Dockerfile

ed43fbf

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

ydshieh merged commit 94db825 into main Jan 4, 2023

ydshieh deleted the fix_docker_image_build branch January 4, 2023 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix (DeepSpeed) docker image build issue #21002

Fix (DeepSpeed) docker image build issue #21002

Uh oh!

ydshieh commented Jan 4, 2023

Uh oh!

ydshieh Jan 4, 2023

Uh oh!

ydshieh Jan 4, 2023

Uh oh!

HuggingFaceDocBuilderDev commented Jan 4, 2023 •

edited

Loading

Uh oh!

stas00 left a comment

Uh oh!

Uh oh!

Uh oh!

Fix (DeepSpeed) docker image build issue #21002

Fix (DeepSpeed) docker image build issue #21002

Uh oh!

Conversation

ydshieh commented Jan 4, 2023

What does this PR do?

Uh oh!

ydshieh Jan 4, 2023

Choose a reason for hiding this comment

Uh oh!

ydshieh Jan 4, 2023

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Jan 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jan 4, 2023 •

edited

Loading