-
Notifications
You must be signed in to change notification settings - Fork 133
Add apex install to pytorch wheel #2280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add apex install to pytorch wheel #2280
Conversation
…apex wheel and output in write_torch_versions, extract apex version in .yml file and use it to copy the wheel to s3 bucket
|
Comment from @ScottTodd #2248 (comment) When these changes are ready there are some docs to update too in https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/README.md
Also in https://github.com/ROCm/TheRock/blob/main/RELEASES.md#installing-pytorch-python-packages, we should mention
|
Do not test in prod. Nightly releases are reserved for scheduled runs on approved/merged code ONLY. I missed this while you were testing but please contact us before running any such workflows. @marbre was in the process of making this ROCm/TheRock repository technically unable to upload nightly releases before he went on vacation (nightly releases will be built by https://github.com/ROCm/rockrel too). Until that work is complete, only use the default "dev" release type which uploads to https://rocm.devreleases.amd.com/ instead of https://rocm.nightlies.amd.com/. |
| checkout_p.add_argument( | ||
| "--hipify", | ||
| action=argparse.BooleanOptionalAction, | ||
| default=True, | ||
| help="Run hipify", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiousity, why does https://github.com/ROCm/apex require running HIPIFY here? As that repository is forked from https://github.com/NVIDIA/apex, could the sources be run through HIPIFY at rest?
(I think running HIPIFY after checkout for consistency with how other repositories are set up is fine, but I do want to confirm first)
| exec( | ||
| [ | ||
| sys.executable, | ||
| "-m", | ||
| "build", | ||
| "--wheel", | ||
| "--no-isolation", | ||
| "-C--build-option=--cpp_ext", | ||
| "-C--build-option=--cuda_ext", | ||
| ], | ||
| cwd=apex_dir, | ||
| env=env, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pulled down a whl from https://rocm.nightlies.amd.com/v2-staging/gfx94X-dcgpu/apex/ (^), I see a bunch of cuda and nccl shared object files:
- Is that expected? Should there be HIP / RCCL file names instead?
- Are these libraries portable across Linux distributions since we build under manylinux, or do they have dependencies on system libraries?
- Is the binary size of 80MB expected? Seems on the larger side.
(^)(again do NOT test in prod - you should test with https://rocm.devreleases.amd.com/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should start with just the changes in external-builds/pytorch/ to allow local builds of apex and then in a second PR include the other changes to get the automation building the wheels. That may have also caught the "don't test in prod" issues and allowed for earlier feedback on the approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ScottTodd Which file(s) should I include in the second PR? Do you mean .github/workflows/build_portable_linux_pytorch_wheels.yml?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1st PR: the changes to external-builds/pytorch/
2nd PR: all other changes (.github/workflows/, build_tools/github_actions/)
The changes to build_tools/third_party/s3_management/manage.py could go in either PR
update comment Co-authored-by: Scott Todd <scott.todd0@gmail.com>
Motivation
Create apex wheels in https://rocm.nightlies.amd.com/v2 for linux OS.
https://github.com/NVIDIA/apex contains NVIDIA-maintained utilities to streamline mixed precision, distributed training, fused kernels in Pytorch. Some modules are later ported into Pytorch eventually. The intent of Apex is to make up-to-date utilities available to users as quickly as possible. https://github.com/rocm/apex is a ROCm port of nvidia/apex repo.
Apex is used in deep learning models such as Megatron-LM, hugging face transformers, OpenSORA, etc.
Apex is installed using the following command:
python setup.py install --cpp_ext --cuda_extTechnical Details
This is the main file ran by CI for creating pytorch wheels: https://github.com/ROCm/TheRock/blob/main/.github/workflows/build_portable_linux_pytorch_wheels.yml
For adding apex
The following diagram and description explain the main steps for creating apex wheel.
Based on pytorch version checkout either related commit or master (if nightly) in .yml file:
a. Create python code to handle checkout of github source at https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch
pytorch_apex_repo.py
This script checkouts the github repo to THIS_MAIN_REPO_NAME folder. It uses githash or branch passed in .yml file.
This is handled by the following lines in .yml file:
b. Add method to https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py for building apex
This code checks if apex folder is created from checkout step and builds the wheel in dist folder
c. Add code to extract apex wheel version in https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/write_torch_versions.py
This code outputs the wheel version in the variable all_versions.
Add code to .yml file to first extract apex wheel version from output. Then use that apex version to upload to s3 bucket.
d. Add apex entry in https://github.com/ROCm/TheRock/blob/main/build_tools/third_party/s3_management/manage.py
Test Plan
Run action in github:
https://github.com/ROCm/TheRock/actions/workflows/build_portable_linux_pytorch_wheels.yml
with the following arguments:
Test Result
Generated apex wheels at- https://rocm.devreleases.amd.com/v2-staging/gfx94X-dcgpu/apex/
apex-1.10.0+rocmsdk20251122-cp312-cp312-linux_x86_64.whl
apex-1.9.0+rocmsdk20251125-cp312-cp312-linux_x86_64.whl
release2.9 - https://github.com/ROCm/TheRock/actions/runs/19788483745
master - https://github.com/ROCm/TheRock/actions/runs/19823557793
Submission Checklist