[reland2][ROCm] preshuffled weight mm #2207

jeffdaily · 2025-05-13T22:04:10Z

No description provided.

pytorch-bot · 2025-05-13T22:04:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2207

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures

As of commit b4115d3 with merge base 5549da8 ():

NEW FAILURES - The following jobs have failed:

Code Analysis with Ruff / build (3.9) (gh)
Process completed with exit code 1.
Run Float8 Tests / test (SM-89, linux.g6.4xlarge.experimental.nvidia.gpu, --pre torch --index-url https://download.p... / linux-job (gh)
RuntimeError: Command docker exec -t b62078a683e4f28aa7f9bb96c483d2ad5848ac8ac64969bd647da8754bad8ede /exec failed with exit code 1
Run Regression Tests / test (CUDA 2.5.1, linux.g5.12xlarge.nvidia.gpu, torch==2.5.1 --index-url https://download.pytorch... / linux-job (gh)
RuntimeError: Command docker exec -t 68746c3a9130dbc81bf9cc70da5ed7eaf6f48519f121dd2e0a41c67757f5cd39 /exec failed with exit code 1
Run Regression Tests / test (CUDA 2.6, linux.g5.12xlarge.nvidia.gpu, torch==2.6.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t 0ac76b1a279be7d3227ea1da394047d2bda1ef0c3e1d840c933df5540836515d /exec failed with exit code 1
Run Regression Tests / test (CUDA 2.7, linux.g5.12xlarge.nvidia.gpu, torch==2.7.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t a37076f900752536b287bc8873d5f2edf0d5694653123975ccc08a32c4b36855 /exec failed with exit code 1
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh)
RuntimeError: Command docker exec -t 1cf7f10ae517851dcb6e5f4f77901174b5b13c67cd1d56aad3079feeed3d471b /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-05-13T22:14:57Z

@mxz297 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

pytorch-bot · 2025-05-14T00:59:03Z

To add the ciflow label ciflow/rocm please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

facebook-github-bot · 2025-05-14T03:52:54Z

@mxz297 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2025-05-14T14:54:39Z

@mxz297 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

mxz297 · 2025-05-14T15:17:44Z

@jeffdaily i am having issues of importing this PR. Can you first try to resolve the build errors?

facebook-github-bot · 2025-05-14T15:31:56Z

@mxz297 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2025-05-14T18:32:10Z

@mxz297 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2025-05-14T23:34:45Z

@mxz297 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

mxz297 · 2025-05-15T17:17:21Z

@jeffdaily there is a linter failure

facebook-github-bot · 2025-05-15T17:20:14Z

@mxz297 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

mxz297 · 2025-05-15T23:10:05Z

@jeffdaily there is also a failure in rocm test

module = Linear(in_features=32, out_features=128, bias=False)
config = MXFPInferenceConfig(block_size=32, activation_dtype=torch.float4_e2m1fn_x2, weight_dtype=torch.float4_e2m1fn_x2, gemm_kernel_choice=<MXGemmKernelChoice.CUTLASS: 'cutlass'>, set_inductor_config=False)

    @register_quantize_module_handler(MXFPInferenceConfig)
    def _mx_inference_linear_transform(
        module: torch.nn.Module, config: MXFPInferenceConfig
    ):
        # TODO Sm120 has slightly more restrictive reqs
        # TODO handle AMD
>       assert is_sm_at_least_100(), "MXFP is only supported on sm100 machiens for now"
E       AssertionError: MXFP is only supported on sm100 machiens for now

but this looks like the test should even not be run on AMD?

cc @drisspg @atalman @jerryzh168

facebook-github-bot · 2025-05-15T23:17:22Z

@mxz297 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

drisspg · 2025-05-16T01:23:45Z

@mxz297 yeah this should be skipped, can you rebase past: #2209

facebook-github-bot · 2025-05-16T17:08:22Z

@mxz297 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

mxz297 · 2025-05-16T17:59:01Z

@pytorchbot run all

pytorch-bot · 2025-05-16T17:59:04Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'run' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick', 'close')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick,close} ...

Try @pytorchbot --help for more info.

mxz297 · 2025-05-16T17:59:57Z

@pytorchbot drci

mxz297 · 2025-05-16T18:10:18Z

@drisspg @atalman @jerryzh168

Seems to have some CUDA test failures where arch string parsing has some issue. Feels unlikely caused by this PR but want to double check with you folks:

Processing /pytorch/ao
  Preparing metadata (setup.py) ... 25l-� �error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [13 lines of output]
      W0516 16:40:07.414810 215 site-packages/torch/utils/cpp_extension.py:118] No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda-12.6'
      W0516 16:40:07.421015 215 site-packages/torch/utils/cpp_extension.py:2414] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
      W0516 16:40:07.421015 215 site-packages/torch/utils/cpp_extension.py:2414] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 35, in <module>
        File "/pytorch/ao/setup.py", line 544, in <module>
          ext_modules=get_extensions(),
        File "/pytorch/ao/setup.py", line 432, in get_extensions
          cuda_arch_flags = _get_cuda_arch_flags()
        File "/opt/conda/envs/venv/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2434, in _get_cuda_arch_flags
          arch_list[-1] += '+PTX'
      IndexError: list index out of range

mxz297 · 2025-05-16T18:12:58Z

Also a noob question: how should i restart ci or ci is always automatically restarted after a new code commit push?

drisspg · 2025-05-16T19:11:54Z

@mxz297 so if you are a meta employee it will automatically restart on commit push but unfortunately for everyone else you will need to manually kick it off

mxz297 · 2025-05-19T15:57:45Z

@drisspg @atalman @jerryzh168

Any insight on the following error?

Processing /pytorch/ao
Preparing metadata (setup.py) ... 25l-� �error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [13 lines of output]
W0516 16:40:07.414810 215 site-packages/torch/utils/cpp_extension.py:118] No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda-12.6'
W0516 16:40:07.421015 215 site-packages/torch/utils/cpp_extension.py:2414] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0516 16:40:07.421015 215 site-packages/torch/utils/cpp_extension.py:2414] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
Traceback (most recent call last):
File "", line 2, in
File "", line 35, in
File "/pytorch/ao/setup.py", line 544, in
ext_modules=get_extensions(),
File "/pytorch/ao/setup.py", line 432, in get_extensions
cuda_arch_flags = _get_cuda_arch_flags()
File "/opt/conda/envs/venv/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2434, in _get_cuda_arch_flags
arch_list[-1] += '+PTX'
IndexError: list index out of range

drisspg · 2025-05-19T16:48:51Z

Taking a look

drisspg · 2025-05-19T17:32:09Z

Okay so this is coming from this line;

>>> from torch.utils.cpp_extension import _get_cuda_arch_flags
>>> _get_cuda_arch_flags()
/Users/drisspg/.conda/envs/nightly/lib/python3.13/site-packages/torch/utils/cpp_extension.py:2410:
UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilat
ion.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    _get_cuda_arch_flags()
    ~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/drisspg/.conda/envs/nightly/lib/python3.13/site-packages/torch/utils/cpp_extension.p
y", line 2430, in _get_cuda_arch_flags
    arch_list[-1] += '+PTX'
    ~~~~~~~~~^^^^
IndexError: list index out of range

When you are calling get_arch_list with no args and the default system arch is not picked up with this logic:

https://github.com/pytorch/pytorch/blob/6487ea30b3fb3fe550d0e8e7feaf25bc3cffb626/torch/utils/cpp_extension.py#L2360

drisspg · 2025-05-22T20:15:58Z

@jeffdaily Can you rebase I am still alittle confused by this CI

[reland2][ROCm] preshuffled weight mm

dc0d4e0

pytorch-bot bot added ci-no-td module: rocm labels May 13, 2025

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 13, 2025

remove debug print statements

fa813ce

petrex added the ciflow/rocm label May 14, 2025

pytorch-bot bot removed the ciflow/rocm label May 14, 2025

remove duplicate registrations caused by patch fuzzing

b7aa777

petrex added the ciflow/rocm label May 14, 2025

lint

f4ec46d

pytorch-bot bot removed the ciflow/rocm label May 15, 2025

drisspg added the topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) label May 16, 2025

Merge branch 'main' into rocm_swizzle_reland2

b4115d3

drisspg mentioned this pull request May 19, 2025

Manually specify flags if no arch set #2219

Open

[reland2][ROCm] preshuffled weight mm #2207

Are you sure you want to change the base?

[reland2][ROCm] preshuffled weight mm #2207

Uh oh!

Conversation

jeffdaily commented May 13, 2025

Uh oh!

pytorch-bot bot commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2207

❌ 6 New Failures

Uh oh!

facebook-github-bot commented May 13, 2025

Uh oh!

pytorch-bot bot commented May 14, 2025

Uh oh!

facebook-github-bot commented May 14, 2025

Uh oh!

facebook-github-bot commented May 14, 2025

Uh oh!

mxz297 commented May 14, 2025

Uh oh!

facebook-github-bot commented May 14, 2025

Uh oh!

facebook-github-bot commented May 14, 2025

Uh oh!

facebook-github-bot commented May 14, 2025

Uh oh!

mxz297 commented May 15, 2025

Uh oh!

facebook-github-bot commented May 15, 2025

Uh oh!

mxz297 commented May 15, 2025

Uh oh!

facebook-github-bot commented May 15, 2025

Uh oh!

drisspg commented May 16, 2025

Uh oh!

facebook-github-bot commented May 16, 2025

Uh oh!

mxz297 commented May 16, 2025

Uh oh!

pytorch-bot bot commented May 16, 2025

Uh oh!

mxz297 commented May 16, 2025

Uh oh!

mxz297 commented May 16, 2025

Uh oh!

mxz297 commented May 16, 2025

Uh oh!

drisspg commented May 16, 2025

Uh oh!

mxz297 commented May 19, 2025

Uh oh!

drisspg commented May 19, 2025

Uh oh!

drisspg commented May 19, 2025

Uh oh!

drisspg commented May 22, 2025

Uh oh!

Uh oh!

pytorch-bot bot commented May 13, 2025 •

edited

Loading