Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add deepspeed test to amd scheduled CI #27633

Merged
merged 35 commits into from
Dec 11, 2023
Merged
Changes from 1 commit
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
1e8ce66
add deepspeed scheduled test for amd
echarlaix Nov 21, 2023
bf276ed
fix image
echarlaix Nov 21, 2023
2cfb53d
add dockerfile
echarlaix Nov 23, 2023
5a9a529
add comment
echarlaix Nov 23, 2023
af46e87
enable tests
echarlaix Nov 23, 2023
c29d249
trigger
echarlaix Nov 27, 2023
a0c3daf
remove trigger for this branch
echarlaix Nov 27, 2023
4cb9d6f
trigger
echarlaix Nov 28, 2023
a703349
change runner env to trigger the docker build image test
echarlaix Nov 28, 2023
a47ac2c
use new docker image
echarlaix Nov 28, 2023
233bd7f
remove test suffix from docker image tag
echarlaix Nov 28, 2023
971ba80
replace test docker image with original image
echarlaix Nov 28, 2023
da4774c
push new image
echarlaix Nov 29, 2023
cbe995f
Trigger
echarlaix Nov 30, 2023
e16c271
add back amd tests
echarlaix Nov 30, 2023
70c3580
fix typo
echarlaix Nov 30, 2023
090b88e
add amd tests back
echarlaix Nov 30, 2023
508ae29
fix
echarlaix Nov 30, 2023
09fee9e
comment until docker image build scheduled test fix
echarlaix Nov 30, 2023
407cfe9
remove deprecated deepspeed build option
echarlaix Nov 30, 2023
f846b80
upgrade torch
echarlaix Nov 30, 2023
785b63a
update docker & make tests pass
Dec 4, 2023
ba8cc9f
Merge branch 'main' into run_amd_scheduled_ci_caller_deepspeed_test
fxmarty Dec 4, 2023
f0f931e
Update docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile
fxmarty Dec 5, 2023
40398b9
fix
echarlaix Dec 5, 2023
3332cd2
tmp disable test
echarlaix Dec 5, 2023
9696cc4
precompile deepspeed to avoid timeout during tests
echarlaix Dec 5, 2023
84a7a33
fix comment
echarlaix Dec 5, 2023
fc6d890
Merge branch 'main' into run_amd_scheduled_ci_caller_deepspeed_test
echarlaix Dec 5, 2023
df00cff
trigger deepspeed tests with new image
echarlaix Dec 5, 2023
92c402d
comment tests
echarlaix Dec 5, 2023
fa82a9c
trigger
echarlaix Dec 6, 2023
ecb9239
add sklearn dependency to fix slow tests
echarlaix Dec 6, 2023
cfcc312
enable back other tests
echarlaix Dec 6, 2023
ae82b3f
final update
ydshieh Dec 7, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
add deepspeed scheduled test for amd
  • Loading branch information
echarlaix committed Nov 21, 2023
commit 1e8ce6607b49ca32c009e4e86467df8ae11a84b6
60 changes: 58 additions & 2 deletions .github/workflows/self-scheduled-amd.yml
Original file line number Diff line number Diff line change
Expand Up @@ -356,6 +356,62 @@ jobs:
name: ${{ matrix.machine_type }}_run_tests_torch_pipeline_gpu
path: /transformers/reports/${{ matrix.machine_type }}_tests_torch_pipeline_gpu

run_all_tests_torch_rocm_deepspeed_gpu:
name: Torch ROCm deepspeed tests
strategy:
fail-fast: false
matrix:
machine_type: [single-gpu, multi-gpu]

runs-on: [self-hosted, docker-gpu, amd-gpu, '${{ matrix.machine_type }}', '${{ inputs.gpu_flavor }}']
needs: setup
container:
image: huggingface/transformers-pytorch-amd-gpu
options: --device /dev/kfd --device /dev/dri --env ROCR_VISIBLE_DEVICES --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface:/mnt/cache/
steps:
- name: Update clone
working-directory: /transformers
run: git fetch && git checkout ${{ github.sha }}

- name: Reinstall transformers in edit mode (remove the one installed during docker image build)
working-directory: /transformers
run: python3 -m pip uninstall -y transformers && python3 -m pip install -e .[deepspeed]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add

      # To avoid unknown test failures
      - name: Pre build DeepSpeed *again*
        working-directory: /workspace
        run: |
          python3 -m pip uninstall -y deepspeed
          DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_UTILS=1 python3 -m pip install deepspeed --global-option="build_ext" --global-option="-j8" --no-cache -v --disable-pip-version-check

to be the same as in other workflow file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure to understand why we need to uninstall and reinstall deepspeed here, what issue does it solve ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember exactly, it has been one year or more ago. I can try to find from the history if you would like to have the information.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in our case we don't need it at the moment, so does it work if we keep it that way ? If you want me to uninstall / reinstall it in the tests, I can directly update and use huggingface/transformers-pytorch-amd-gpu and just install deepspeed in the tests directly (to avoid this step)

- name: ROCM-SMI
run: |
rocm-smi
- name: ROCM-INFO
run: |
rocminfo | grep "Agent" -A 14
- name: Show ROCR environment
run: |
echo "ROCR: $ROCR_VISIBLE_DEVICES"

- name: Environment
working-directory: /transformers
run: |
python3 utils/print_env.py

- name: Show installed libraries and their versions
working-directory: /transformers
run: pip freeze

- name: Run all tests on GPU
working-directory: /transformers
run: python3 -m pytest -v --make-reports=${{ matrix.machine_type }}_tests_torch_rocm_deepspeed_gpu tests/deepspeed tests/extended

- name: Failure short reports
if: ${{ failure() }}
continue-on-error: true
run: cat /transformers/reports/${{ matrix.machine_type }}_tests_torch_rocm_deepspeed_gpu/failures_short.txt

- name: Test suite reports artifacts
if: ${{ always() }}
uses: actions/upload-artifact@v3
with:
name: ${{ matrix.machine_type }}_run_tests_torch_rocm_deepspeed_gpu_test_reports
path: /transformers/reports/${{ matrix.machine_type }}_tests_torch_rocm_deepspeed_gpu

run_extract_warnings:
name: Extract warnings in CI artifacts
runs-on: ubuntu-22.04
Expand All @@ -368,7 +424,7 @@ jobs:
run_tests_multi_gpu,
run_examples_gpu,
run_pipelines_torch_gpu,
# run_all_tests_torch_cuda_extensions_gpu
run_all_tests_torch_rocm_deepspeed_gpu
]
steps:
- name: Checkout transformers
Expand Down Expand Up @@ -417,7 +473,7 @@ jobs:
run_tests_multi_gpu,
run_examples_gpu,
run_pipelines_torch_gpu,
# run_all_tests_torch_cuda_extensions_gpu,
run_all_tests_torch_rocm_deepspeed_gpu,
run_extract_warnings
]
steps:
Expand Down
Loading