Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable pytorch nightly CI #17335

Merged
merged 3 commits into from
Jun 17, 2022
Merged

Enable pytorch nightly CI #17335

merged 3 commits into from
Jun 17, 2022

Conversation

ydshieh
Copy link
Collaborator

@ydshieh ydshieh commented May 18, 2022

What does this PR do?

  • Make necessary changes to build huggingface/transformers-pytorch-nightly-gpu
  • Update self-nightly-scheduled.yml to run PyTorch nightly build CI (almost a copy from scheduled CI)

test workflow run here.
docker build run

Print versions

Screenshot 2022-05-19 214304

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 18, 2022

The documentation is not available anymore as the PR was closed or merged.

@ydshieh ydshieh force-pushed the enable_pytorch_nightly_ci branch from 51f3a91 to b605c9c Compare May 19, 2022 12:04
@ydshieh
Copy link
Collaborator Author

ydshieh commented May 19, 2022

@stas00, @LysandreJik

I run the docker image transformers-pytorch-deepspeed-latest-gpu and found it's PyTroch has version 1.9.0.
This image is based on nvcr.io/nvidia/pytorch:21.03-py3, which is used in the job Test Torch CUDA Extensions for both daily scheduled CI and push CI.

We can discuss about this. But so far I won't include DeepSpeed test with nightly-built PyTorch.

>>> import torch
>>> torch.__version__
'1.9.0a0+df837d0'
>>> exit()

@ydshieh ydshieh requested review from LysandreJik and stas00 May 19, 2022 13:26
@stas00
Copy link
Contributor

stas00 commented May 19, 2022

I run the docker image transformers-pytorch-deepspeed-latest-gpu and found it's PyTroch has version 1.9.0.
This image is based on nvcr.io/nvidia/pytorch:21.03-py3, which is used in the job Test Torch CUDA Extensions for both daily scheduled CI and push CI.

But that's not nightly. I don't think you can rely on any pre-made docker to run nightly. It has to be manually installed since it is a new version every day. You probably want to switch to 22.04 (latest at the moment) and then update it to the actual nightly. Does it make sense?

I also don't understand why deepspeed tests were removed. It's critical that we run deepspeed tests on nightly.

@ydshieh
Copy link
Collaborator Author

ydshieh commented May 19, 2022

@stas00

The non DeepSpeed tests added in this PR use pytorch-nightly. You can verify in this run page https://github.com/huggingface/transformers/runs/6507879699?check_suite_focus=true and click Echo versions, and you will see Pytorch Version: 1.12.0.dev20220519+cu102. (It will change everyday).

DeepSpeed

Regarding this, we have something to discuss:

  • For the push and scheuled CIs currently running, the deepspeed tests are run with PyTorch 1.9.0.

    • Before going to the nightly pytorch with deepspeed, it might be better to decide what should we test for push and scheduled CI for deepspeed test. Should we use the latest stable Pytorch instead?
  • I don't mean to remove DeepSpeed tests with nightly PyTorch. The reason is that I am not able to make a docker image with PyTorch Nightly + DeepSpeed. I even tried with PyTorch Stable + DeepSpeed and the docker image also fails.

Details

In the Dockerfile transformers-pytorch-deepspeed-latest-gpu:

RUN python3 -m pip install --no-cache-dir -e ./transformers[deepspeed-testing]

# This fail with if we install Pytorch Nightly or Stable above.
RUN git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build && \
    DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install -e . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1

@stas00
Copy link
Contributor

stas00 commented May 19, 2022

Should we use the latest stable Pytorch instead?

Yes, please.

I think we should use the latest stable pytorch for all our tests unless we explicitly test older pytorch versions every few days or so. And to ensure we update to the new stable once it gets released.

I even tried with PyTorch Stable + DeepSpeed and the docker image also fails.

Could you please point me to the actual issues and I will help to sort them out?

We can discuss it on slack if it's easier.

@ydshieh ydshieh marked this pull request as draft May 19, 2022 16:40
@ydshieh ydshieh marked this pull request as ready for review May 19, 2022 19:46
Copy link
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you, @ydshieh!

.github/workflows/self-nightly-scheduled.yml Show resolved Hide resolved
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, @ydshieh! Thanks for taking care of that.

# this line must be added in order for python to be aware of transformers.
RUN cd transformers && python3 setup.py develop

RUN python3 -c "from deepspeed.launcher.runner import main"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to test it!

@ydshieh ydshieh force-pushed the enable_pytorch_nightly_ci branch from 618ef3a to 5909e53 Compare June 13, 2022 09:22
Copy link
Collaborator Author

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @LysandreJik @stas00

After the recent changes regarding tests, I have to update this PR. In particular, I would like to make sure the goal of the scheduled nightly test(s) are:

  • tests with nightly torch + latest TF (+ stable Flax)
  • tests with latest torch + nightly TF (+ stable Flax)
  • tests with latest torch + latest TF (+ nightlyFlax)

Here the summary of the latest changes in this PR

  • build a new docker image with nightly torch + latest stable TF
  • make self-nightly-scheduled the same as self-scheduled, except:
    • only keep the following tests (so far), running with docker images having nightly torch and/or deepspeed + stable release TF
      • run_tests_single_gpu
      • run_tests_multi_gpu
      • run_all_tests_torch_cuda_extensions_gpu
    • (we could add run_examples_gpu and run_pipelines_torch_gpu in a follow up PR if necessary)

@LysandreJik
Copy link
Member

I don't think we really care about nightly flax right now, so I would keep the following:

  • tests with nightly torch + latest TF
  • tests with latest torch + nightly TF

The rest sounds good to me!

@ydshieh ydshieh force-pushed the enable_pytorch_nightly_ci branch from 5909e53 to 3096e2a Compare June 13, 2022 15:54
Copy link
Collaborator Author

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, @LysandreJik @stas00 !

I have updated this PR. I think the best way to save your time reviewing is to check my comments below.

If everything is fine, I will run some tests to make sure before merge.

context: ./docker/transformers-all-latest-gpu
build-args: |
REF=main
PYTORCH=pre
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to install nightly build torch

@@ -1,250 +1,235 @@
name: Self-hosted runner; Nightly (scheduled)
name: Self-hosted runner (nightly)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically copied from self-scheduled.yml, with the necessary changes to use nightly built torch & deepspeed master.

Copy link
Collaborator Author

@ydshieh ydshieh Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, I can change this back if prefered - I didn't pay attention here.

RUN_SLOW: yes
SIGOPT_API_TOKEN: ${{ secrets.SIGOPT_API_TOKEN }}
TF_FORCE_GPU_ALLOW_GROWTH: true
RUN_PT_TF_CROSS_TESTS: 1
Copy link
Collaborator Author

@ydshieh ydshieh Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the values in self-scheduled.yml instead of the values in the old self-nightly-scheduled.yml, hope this makes sense.

@@ -3,6 +3,9 @@ LABEL maintainer="Hugging Face"

ARG DEBIAN_FRONTEND=noninteractive

# Used to read variables from `~/.profile` (to pass dynamic created variables between RUN commands)
SHELL ["sh", "-lc"]
Copy link
Collaborator Author

@ydshieh ydshieh Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dockfile is quite limited, and I really need to pass some variables between RUN command.

@@ -21,11 +24,20 @@ ARG REF=main
RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF
RUN python3 -m pip install --no-cache-dir -e ./transformers[dev,onnxruntime]

RUN python3 -m pip install --no-cache-dir -U torch==$PYTORCH torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/$CUDA
# TODO: Handle these in a python utility script
RUN [ ${#PYTORCH} -gt 0 -a "$PYTORCH" != "pre" ] && VERSION='torch=='$PYTORCH'.*' || VERSION='torch'; echo "export VERSION='$VERSION'" >> ~/.profile
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we need SHELL ["sh", "-lc"] in order to pass VERSION to the next line.

# `torchvision` and `torchaudio` should be installed along with `torch`, especially for nightly build.
# Currently, let's just use their latest releases (when `torch` is installed with a release version)
# TODO: We might need to specify proper versions that work with a specific torch version (especially for past CI).
RUN [ "$PYTORCH" != "pre" ] && python3 -m pip install --no-cache-dir -U $VERSION torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/$CUDA || python3 -m pip install --no-cache-dir -U --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/$CUDA
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think we can make things and life easier here. I have some POC, and can do it in a follow-up PR.

# This has to be run inside the GPU VMs running the tests. (So far, it fails here due to GPU checks during compilation.)
# Issue: https://github.com/microsoft/DeepSpeed/issues/2010
# RUN git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build && \
# DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to fix this one quickly I think on the DS side - let's see what the maintainers think - I can make a PR to not have it fail easily.

Copy link
Collaborator Author

@ydshieh ydshieh Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no rush, as pre-build works inside docker run. We can always change it back (to pre-build DeepSpeed in Dockerfile) once the issue is fixed on DeepSpeed side.

I will double check that issue before merging this PR (and make the cleaning up if the fix is done at that time).

# This has to be run (again) inside the GPU VMs running the tests.
# The installation works here, but some tests fail, if we don't pre-build deepspeed again in the VMs running the tests.
# TODO: Find out why test fail.
RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to pre-build again in workflow file (i.e. when docker is launched) - until a fix in deepspeed is done.

@stas00
Copy link
Contributor

stas00 commented Jun 13, 2022

Looks good to me, @ydshieh! Thank you!

python3 -m pip uninstall -y deepspeed
rm -rf DeepSpeed
git clone https://github.com/microsoft/DeepSpeed && cd DeepSpeed && rm -rf build
DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 python3 -m pip install . --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre-build deepspeed (master version), which is currently necessary.

@ydshieh
Copy link
Collaborator Author

ydshieh commented Jun 14, 2022

A full workflow run is here.

@ydshieh ydshieh force-pushed the enable_pytorch_nightly_ci branch from 1decdc0 to 417c586 Compare June 17, 2022 09:04
@ydshieh ydshieh merged commit ca169db into main Jun 17, 2022
@ydshieh ydshieh deleted the enable_pytorch_nightly_ci branch June 17, 2022 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants