Remove int_scaled_mm's dependency on triton for cpu #128

Xia-Weiwen · 2024-04-08T10:32:17Z

The int_scaled_mm op in Torchao is designed for CUDA at the beginning. However, this op is also needed for CPU. This PR adds a path for CPU in intmm.int_scaled_mm. It is not registered as and implementation for torchao.int_scaled_mm because we want to use Inductor for further optimization which cannot recognize the torchao.int_scaled_mm op.

This change requires this PR pytorch/pytorch#136942. Otherwise, there might be numerical issues. So, it works with pytorch nightly since 20241026.

Test is covered by test/kernel/test_autotuner.py and test/prototype/test_smoothquant.py.

Xia-Weiwen · 2024-04-08T10:33:19Z

Hi @cpuhrsch Could you please review and see if the changes are reasonable to you? Thanks.

Xia-Weiwen · 2024-04-10T02:09:39Z

Hi @cpuhrsch Could you please suggest how to deal with the issue (CPU impl availability depends on triton and AUTOTUNER_ENABLE)? Thanks!

cpuhrsch · 2024-04-10T18:15:12Z

Hey @Xia-Weiwen - Thank you for the PR! Sorry for the delay in review. Also, please note the CI hasn't run green.

Another way to resolve this could be to move

@torch.library.impl(lib, "int_scaled_matmul", "CPU")
def int_scaled_matmul_cpu(a, b, scales1):
    c = torch._int_mm(a, b)
    return c.to(scales1.dtype) * scales1

into torchao/kernel/intmm.py which shouldn't have a dependency on triton. Just be sure to also define lib = torch.library.Library("torchao", "FRAGMENT")

Xia-Weiwen · 2024-04-11T07:39:37Z

@cpuhrsch Thanks! I will give it a try. A question is what AUTOTUNER_ENABLE is and whether CPU impl should depend on it or not.

cpuhrsch · 2024-04-11T17:09:48Z

@Xia-Weiwen - it's used for a Triton autotuner that allows us to cycle over a very large number of configs for a given fixed input shape. See https://github.com/pytorch-labs/ao/tree/main/torchao/kernel#autotuner-and-custom-triton-kernels

Xia-Weiwen · 2024-04-12T08:19:29Z

Thank you @cpuhrsch. Looks like CPU impl does not need this.

pytorch-bot · 2024-10-28T09:23:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/128

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 97dfea8 with merge base cbd90e3 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Xia-Weiwen · 2024-10-29T01:57:24Z

Hi @cpuhrsch This PR requires latest torch nightly (after 20241026) to pass CI. May I know when torch nightly will be updated in the CI? Thanks.

cpuhrsch · 2024-10-29T03:41:07Z

@Xia-Weiwen - can you try merging the latest version of main? You might be built on top of a commit that pinned the nightly version. I see dev20241022 here: https://github.com/pytorch/ao/actions/runs/11550979799/job/32147052309?pr=128

Xia-Weiwen · 2024-10-29T05:23:04Z

@Xia-Weiwen - can you try merging the latest version of main? You might be built on top of a commit that pinned the nightly version. I see dev20241022 here: https://github.com/pytorch/ao/actions/runs/11550979799/job/32147052309?pr=128

Thanks

Xia-Weiwen · 2024-10-29T12:34:29Z

Hi @cpuhrsch CI is green. Could you please review? Thanks.

* code beautification * debug info * debug * add missing args * typo * fix dtype check

Avoid int_scaled_mm's dependency on triton for cpu

d7f77c5

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 8, 2024

cpuhrsch and others added 7 commits April 15, 2024 12:23

Merge branch 'main' into cpu_int_scaled_mm_2

ded45c1

Merge branch 'main' into cpu_int_scaled_mm_2

e5758e1

refine implementation

946374c

Fall back to ref if torch._int_mm not available on CPU

43c25b6

Merge branch 'main' into cpu_int_scaled_mm_2

f3b979a

Merge branch 'main' into cpu_int_scaled_mm_2

b5b9525

Refine code; _int_mm on CPU requires torch 2.6

0930f71

Xia-Weiwen requested a review from cpuhrsch October 29, 2024 01:55

Merge branch 'main' into cpu_int_scaled_mm_2

9f366a9

Decompose scaled_int_mm for CPU

97dfea8

cpuhrsch approved these changes Oct 29, 2024

View reviewed changes

cpuhrsch merged commit 5cfc4c7 into pytorch:main Oct 29, 2024
17 checks passed

yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024

code beautification (pytorch#128)

86e5374

* code beautification * debug info * debug * add missing args * typo * fix dtype check

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove int_scaled_mm's dependency on triton for cpu #128

Remove int_scaled_mm's dependency on triton for cpu #128

Xia-Weiwen commented Apr 8, 2024 •

edited

Loading

Xia-Weiwen commented Apr 8, 2024

Xia-Weiwen commented Apr 10, 2024

cpuhrsch commented Apr 10, 2024

Xia-Weiwen commented Apr 11, 2024

cpuhrsch commented Apr 11, 2024

Xia-Weiwen commented Apr 12, 2024

pytorch-bot bot commented Oct 28, 2024 •

edited

Loading

Xia-Weiwen commented Oct 29, 2024

cpuhrsch commented Oct 29, 2024

Xia-Weiwen commented Oct 29, 2024

Xia-Weiwen commented Oct 29, 2024

Remove int_scaled_mm's dependency on triton for cpu #128

Remove int_scaled_mm's dependency on triton for cpu #128

Conversation

Xia-Weiwen commented Apr 8, 2024 • edited Loading

Xia-Weiwen commented Apr 8, 2024

Xia-Weiwen commented Apr 10, 2024

cpuhrsch commented Apr 10, 2024

Xia-Weiwen commented Apr 11, 2024

cpuhrsch commented Apr 11, 2024

Xia-Weiwen commented Apr 12, 2024

pytorch-bot bot commented Oct 28, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/128

✅ No Failures

Xia-Weiwen commented Oct 29, 2024

cpuhrsch commented Oct 29, 2024

Xia-Weiwen commented Oct 29, 2024

Xia-Weiwen commented Oct 29, 2024

Xia-Weiwen commented Apr 8, 2024 •

edited

Loading

pytorch-bot bot commented Oct 28, 2024 •

edited

Loading