Skip to content

Conversation

@amd-sriram
Copy link

@amd-sriram amd-sriram commented Nov 24, 2025

Motivation

Create apex wheels in https://rocm.nightlies.amd.com/v2 for linux OS.

https://github.com/NVIDIA/apex contains NVIDIA-maintained utilities to streamline mixed precision, distributed training, fused kernels in Pytorch. Some modules are later ported into Pytorch eventually. The intent of Apex is to make up-to-date utilities available to users as quickly as possible. https://github.com/rocm/apex is a ROCm port of nvidia/apex repo.

Apex is used in deep learning models such as Megatron-LM, hugging face transformers, OpenSORA, etc.

Apex is installed using the following command:

python setup.py install --cpp_ext --cuda_ext

Technical Details

This is the main file ran by CI for creating pytorch wheels: https://github.com/ROCm/TheRock/blob/main/.github/workflows/build_portable_linux_pytorch_wheels.yml

For adding apex

The following diagram and description explain the main steps for creating apex wheel.

the rock wheel
  1. Checkout PyTorch Source Repos

Based on pytorch version checkout either related commit or master (if nightly) in .yml file:

./external-builds/pytorch/pytorch_audio_repo.py checkout --require-related-commit
./external-builds/pytorch/pytorch_apex_repo.py checkout --repo-hashtag master

a. Create python code to handle checkout of github source at https://github.com/ROCm/TheRock/tree/main/external-builds/pytorch

pytorch_apex_repo.py

THIS_MAIN_REPO_NAME = "apex"
DEFAULT_ORIGIN = "https://github.com/ROCm/apex.git"
DEFAULT_HASHTAG = "master"

This script checkouts the github repo to THIS_MAIN_REPO_NAME folder. It uses githash or branch passed in .yml file.

  1. Build PyTorch Wheels

This is handled by the following lines in .yml file:

name: Build PyTorch Wheels
        id: build-pytorch-wheels
        run: |
          echo "Building PyTorch wheels for ${{ inputs.amdgpu_family }}"
          ./external-builds/pytorch/build_prod_wheels.py \
          python ./build_tools/github_actions/write_torch_versions.py --dist-dir ${{ env.PACKAGE_DIST_DIR }}
python ./build_tools/github_actions/write_torch_versions.py --dist-dir ${{ env.PACKAGE_DIST_DIR }}

b. Add method to https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py for building apex

def do_build_apex(
    args: argparse.Namespace, apex_dir: Path, env: dict[str, str]
):
    # Compute version.
    build_version = (apex_dir / "version.txt").read_text().strip()
    build_version += args.version_suffix
    print(f"  Default apex BUILD_VERSION: {build_version}")
    env["BUILD_VERSION"] = build_version
    env["BUILD_NUMBER"] = args.pytorch_build_number

    remove_dir_if_exists(apex_dir / "dist")
    if args.clean:
        remove_dir_if_exists(apex_dir / "build")

    exec([sys.executable, "-m", "build", "--wheel", "--no-isolation", "-C", "--build-option=--cpp_ext", "-C", "--build-option=--cuda_ext"], cwd=pytorch_apex_dir, env=env)
    built_wheel = find_built_wheel(apex_dir / "dist", "apex")
    print(f"Found built wheel: {built_wheel}")
    copy_to_output(args, built_wheel)

This code checks if apex folder is created from checkout step and builds the wheel in dist folder

c. Add code to extract apex wheel version in https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/write_torch_versions.py

apex_version = get_wheel_version(package_dist_dir, "apex")
if apex_version:
        all_versions = all_versions | {"apex_version": triton_version}
    elif os.lower() == "windows":
        _log("Did not find apex (that's okay, is not currently built on Windows)")
    else:
        raise FileNotFoundError("Did not find apex wheel")

This code outputs the wheel version in the variable all_versions.

  1. Release PyTorch Wheels to S3

Add code to .yml file to first extract apex wheel version from output. Then use that apex version to upload to s3 bucket.

apex_version: ${{ steps.build-pytorch-wheels.outputs.apex_version }}

APEX_VERSION: "${{ needs.build_pytorch_wheels.outputs.apex_version }}"

aws s3 cp \
--include "apex-${APEX_VERSION}-${CP_VERSION}-linux_x86_64.whl" \
  1. Add apex wheel index to https://rocm.nightlies.amd.com/

d. Add apex entry in https://github.com/ROCm/TheRock/blob/main/build_tools/third_party/s3_management/manage.py

Test Plan

Run action in github:
https://github.com/ROCm/TheRock/actions/workflows/build_portable_linux_pytorch_wheels.yml

with the following arguments:

branch - add_apex_install_to_pytorch_wheel
type of release - dev
pytorch version to checkout - release/2.9, nightly (contains JIT load code)
rocm version - rocm_2.9, nightly

Test Result

Generated apex wheels at- https://rocm.devreleases.amd.com/v2-staging/gfx94X-dcgpu/apex/

apex-1.10.0+rocmsdk20251122-cp312-cp312-linux_x86_64.whl
apex-1.9.0+rocmsdk20251125-cp312-cp312-linux_x86_64.whl

release2.9 - https://github.com/ROCm/TheRock/actions/runs/19788483745
master - https://github.com/ROCm/TheRock/actions/runs/19823557793

Submission Checklist

@amd-sriram
Copy link
Author

amd-sriram commented Nov 24, 2025

Comment from @ScottTodd #2248 (comment)

When these changes are ready there are some docs to update too in https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/README.md

  • Add to support status, either nested under "torch" if part of the "torch" python package or with a new row if this is a new package (it looks like it is)
  • Add information about which branches are checked out to the table a bit below that
  • Mention the new checkout command in the quickstart instructions and in the "alternate branches / patch sets" instructions (lots of boilerplate there now and this will expand it, we may want to refactor the docs)

Also in https://github.com/ROCm/TheRock/blob/main/RELEASES.md#installing-pytorch-python-packages, we should mention

  • compatible versions of this new package
  • install instructions (or omit? we may also want to mention pytorch-triton-rocm but the docs are currently used for both Windows and Linux)

@amd-sriram amd-sriram marked this pull request as ready for review November 26, 2025 14:28
@ScottTodd
Copy link
Member

Test Plan

Run action in github: https://github.com/ROCm/TheRock/actions/workflows/build_portable_linux_pytorch_wheels.yml

with the following arguments:

branch - add_apex_install_to_pytorch_wheel
type of release - nightly
pytorch version to checkout - release/2.9, JIT load, nightly
rocm version - rocm_2.9, nightly

Do not test in prod. Nightly releases are reserved for scheduled runs on approved/merged code ONLY. I missed this while you were testing but please contact us before running any such workflows. @marbre was in the process of making this ROCm/TheRock repository technically unable to upload nightly releases before he went on vacation (nightly releases will be built by https://github.com/ROCm/rockrel too). Until that work is complete, only use the default "dev" release type which uploads to https://rocm.devreleases.amd.com/ instead of https://rocm.nightlies.amd.com/.

Comment on lines +110 to +115
checkout_p.add_argument(
"--hipify",
action=argparse.BooleanOptionalAction,
default=True,
help="Run hipify",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiousity, why does https://github.com/ROCm/apex require running HIPIFY here? As that repository is forked from https://github.com/NVIDIA/apex, could the sources be run through HIPIFY at rest?

(I think running HIPIFY after checkout for consistency with how other repositories are set up is fine, but I do want to confirm first)

Comment on lines +906 to +918
exec(
[
sys.executable,
"-m",
"build",
"--wheel",
"--no-isolation",
"-C--build-option=--cpp_ext",
"-C--build-option=--cuda_ext",
],
cwd=apex_dir,
env=env,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled down a whl from https://rocm.nightlies.amd.com/v2-staging/gfx94X-dcgpu/apex/ (^), I see a bunch of cuda and nccl shared object files:

Image
  1. Is that expected? Should there be HIP / RCCL file names instead?
  2. Are these libraries portable across Linux distributions since we build under manylinux, or do they have dependencies on system libraries?
  3. Is the binary size of 80MB expected? Seems on the larger side.

(^)(again do NOT test in prod - you should test with https://rocm.devreleases.amd.com/)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should start with just the changes in external-builds/pytorch/ to allow local builds of apex and then in a second PR include the other changes to get the automation building the wheels. That may have also caught the "don't test in prod" issues and allowed for earlier feedback on the approach.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScottTodd Which file(s) should I include in the second PR? Do you mean .github/workflows/build_portable_linux_pytorch_wheels.yml?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1st PR: the changes to external-builds/pytorch/
2nd PR: all other changes (.github/workflows/, build_tools/github_actions/)

The changes to build_tools/third_party/s3_management/manage.py could go in either PR

update comment

Co-authored-by: Scott Todd <scott.todd0@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: TODO

Development

Successfully merging this pull request may close these issues.

3 participants