-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add deepspeed test to amd scheduled CI #27633
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
@@ -271,3 +271,39 @@ jobs: | |||
REF=main | |||
push: true | |||
tags: huggingface/transformers-tensorflow-gpu | |||
|
|||
# latest-pytorch-deepspeed-amd: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ydshieh is there something we need to do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another issue that we can deal with outside this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But here we have to build the image manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just added one manually so that we can verify the deepspeed tests : echarlaix/amd-deepspeed-test
@@ -5,7 +5,7 @@ on: | |||
- cron: "17 2 * * *" | |||
push: | |||
branches: | |||
- run_amd_scheduled_ci_caller* | |||
- run_amd_scheduled_ci_caller__* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will remove this modification before merging (added to disable all the other AMD scheduled tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @echarlaix !
LGTM 🚀 , except one question in the docker file.
FROM rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1 | ||
LABEL maintainer="Hugging Face" | ||
|
||
ARG DEBIAN_FRONTEND=noninteractive | ||
ARG PYTORCH='2.0.1' | ||
ARG ROCM='5.7' | ||
|
||
RUN apt update && \ | ||
apt install -y --no-install-recommends libaio-dev git && \ | ||
apt clean && \ | ||
rm -rf /var/lib/apt/lists/* | ||
|
||
RUN python3 -m pip install --no-cache-dir --upgrade pip | ||
|
||
RUN python3 -m pip uninstall -y apex | ||
|
||
ARG REF=main | ||
WORKDIR / | ||
RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF | ||
RUN python3 -m pip install --no-cache-dir ./transformers[deepspeed-testing] | ||
|
||
# When installing in editable mode, `transformers` is not recognized as a package. | ||
# this line must be added in order for python to be aware of transformers. | ||
RUN cd transformers && python3 setup.py develop | ||
|
||
RUN python3 -c "from deepspeed.launcher.runner import main" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will (always) use the torch 2.0.1
which is not what we want. I suggest to
-
either based on
docker/transformers-pytorch-amd-gpu/Dockerfile
on currentmain
-
or not even to build an AMD+deepespeed image: just build+install deepspeed at CI runtime
-
otherwise, based on this image, but need to install the
- ARG PYTORCH='2.1.0'
- ARG TORCH_VISION='0.16.0'
- ARG TORCH_AUDIO='2.1.0'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ydshieh that makes sense, just upgraded the torch version in f846b80.
I cannot use docker/transformers-pytorch-amd-gpu/Dockerfile
at the moment as there is an incompatibility with ROCm and deepspeed for some reason (coming from the adam extension), but can modify it and change its parent image if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, try to build the image + run it to make sure we are still good.
(more importantly, check the versions of installed torch etc. do give the expected versions - we do get some surprise sometimes :-) )
Regarding where to build the image, let's talk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I can update huggingface/transformers-pytorch-deepspeed-amd-gpu
and trigger the tests again so that we can check everything is working as expected before merging. When running the tests locally (with the updated image so with torch==2.1.0+rocm5.6
), the failing tests looked like the same as for the current CI
Let's cancel it. I will show you how we usually do the experimentation tomorrow. |
Sorry! |
# Invalidate docker cache from here if new commit is available. | ||
ADD https://api.github.com/repos/huggingface/transformers/git/refs/heads/main version.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? I mean, do you see any issue so adding this line to avoid it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the motivation: https://stackoverflow.com/questions/36996046/how-to-prevent-dockerfile-caching-git-clone
We do not want docker to cache the following git clone
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with this part. Is it relevant only if we build the image on the same machine multiple times?
So far we don't have such issue, but maybe that is because we build the image on Github Actions hosted machine rather than self-hosted VM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the issue locally when using this docker image, where I would not pick up the latest transformers commit due to docker cache. I think it is useful to make sure we use the latest commit - even though in the CI this is not an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok。
So the issue occurs when you run the docker image, not at the time of docker build, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is more on when we use the image, I think it is expected. These images are not built for long-term usage: they are re-built on a daily basis to run the CI at a specific and the same commit .
Anyone wants to use the image locally is responsible to do git fetch && git pull
(or git checkout
).
If we accept this change (and IIUC), it means, on the CI (GitHub Actions), each job may get different commits to run the test against, which is not what we want.
The above is just to explain the current behavior (before this change), not meaning we have issue, as on CI, we have
git fetch && git checkout ${{ github.sha }}
so we are safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the issue occurs when you run the docker image, not at the time of docker build, right?
No, docker build
caches intermediate layers, one of them being RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF
. If we use REF=main
and actually want to use the latest commit, we need to invalidate the docker cache and this is what this ADD
is doing.
Otherwise, docker cached intermediate layer is used and we may use an outdated commit compared to the latest commit available at build time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so during build time. Thanks for the detailed explanation.
@@ -271,3 +271,39 @@ jobs: | |||
REF=main | |||
push: true | |||
tags: huggingface/transformers-tensorflow-gpu | |||
|
|||
# latest-pytorch-deepspeed-amd: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another issue that we can deal with outside this PR.
ADD https://api.github.com/repos/huggingface/transformers/git/refs/heads/main version.json | ||
RUN git clone https://github.com/huggingface/transformers && cd transformers && git checkout $REF | ||
|
||
RUN python3 -m pip install --no-cache-dir ./transformers[deepspeed-testing] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to have
RUN DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check 2>&1
or whatever equivalent for deepspeed in ROCM if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure we can pre-compile deepspeed for this set of ops, was just wondering whether we can keep it in jit mode so that all the machine compatible ops can be dynamically build at runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jit mode will make some tests slower and potentially timeout, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @echarlaix . The most important is to do this again at CI time, as mentioned in the comment below.
(It may or may not relevant now, but I never checked again. I keep both to just avoid potential issue popping up)
- name: Reinstall transformers in edit mode (remove the one installed during docker image build) | ||
working-directory: /transformers | ||
run: python3 -m pip uninstall -y transformers && python3 -m pip install -e . | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add
# To avoid unknown test failures
- name: Pre build DeepSpeed *again*
working-directory: /workspace
run: |
python3 -m pip uninstall -y deepspeed
DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_UTILS=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check
to be the same as in other workflow file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure to understand why we need to uninstall and reinstall deepspeed here, what issue does it solve ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember exactly, it has been one year or more ago. I can try to find from the history if you would like to have the information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in our case we don't need it at the moment, so does it work if we keep it that way ? If you want me to uninstall / reinstall it in the tests, I can directly update and use huggingface/transformers-pytorch-amd-gpu
and just install deepspeed in the tests directly (to avoid this step)
@@ -271,3 +271,39 @@ jobs: | |||
REF=main | |||
push: true | |||
tags: huggingface/transformers-tensorflow-gpu | |||
|
|||
# latest-pytorch-deepspeed-amd: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But here we have to build the image manually.
Hi @fxmarty Let me know if you have any question or need help regarding my above comments. |
Hi @echarlaix You don't need to run all the tests. Just make sure
|
There is 11 failing tests for AMD vs 7 for the current CI, these 4 tests are bf16 variant of already failing tests. Failing test current CI :
Failing tests for AMD :
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fxmarty @echarlaix ! I pushed a commit to finalize it.
Ping @LysandreJik for a core maintainer's review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks all!
I will merge after a fix #27951 being merged. |
* add deepspeed scheduled test for amd * fix image * add dockerfile * add comment * enable tests * trigger * remove trigger for this branch * trigger * change runner env to trigger the docker build image test * use new docker image * remove test suffix from docker image tag * replace test docker image with original image * push new image * Trigger * add back amd tests * fix typo * add amd tests back * fix * comment until docker image build scheduled test fix * remove deprecated deepspeed build option * upgrade torch * update docker & make tests pass * Update docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile * fix * tmp disable test * precompile deepspeed to avoid timeout during tests * fix comment * trigger deepspeed tests with new image * comment tests * trigger * add sklearn dependency to fix slow tests * enable back other tests * final update --------- Co-authored-by: Felix Marty <felix@hf.co> Co-authored-by: Félix Marty <9808326+fxmarty@users.noreply.github.com> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* add deepspeed scheduled test for amd * fix image * add dockerfile * add comment * enable tests * trigger * remove trigger for this branch * trigger * change runner env to trigger the docker build image test * use new docker image * remove test suffix from docker image tag * replace test docker image with original image * push new image * Trigger * add back amd tests * fix typo * add amd tests back * fix * comment until docker image build scheduled test fix * remove deprecated deepspeed build option * upgrade torch * update docker & make tests pass * Update docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile * fix * tmp disable test * precompile deepspeed to avoid timeout during tests * fix comment * trigger deepspeed tests with new image * comment tests * trigger * add sklearn dependency to fix slow tests * enable back other tests * final update --------- Co-authored-by: Felix Marty <felix@hf.co> Co-authored-by: Félix Marty <9808326+fxmarty@users.noreply.github.com> Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
No description provided.