[release/2.6] [SWDEV-531526] [SWDEV-527340] Allocation of buffers ordered before compute #2276

jataylo · 2025-06-16T12:46:12Z

Ensure fused nodes that allocate buffers come before kernels that usethose buffers

In one example we observed:

op8 creates buf10 which mutates buf8
triton_poi_fused_index_put_lift_fresh_2 kernel tries to use buf8 and buf9
op6_op7_op16 (fused node) creates buf8 and buf9

But the standard topological sort didn't ensure that the fused node creating buf8 and buf9 came before the kernel using them.

After this PR we will identify op8 performs a mutation on buf8, find the node that is responsible for creating the buffer (op6_op7_op16) and add an explicit dependency so now op8 depends on op6_op7_op16 and orders graph accordingly.

Note this issue is not seen in PT2.7, not clear as to why. We will hold back on upstreaming this until we observe a similar issue on nightly.

Reproducer code (simplified from megatron)
https://gist.github.com/jataylo/10bedef08323441c588d2965ad963ae8

Execute with

torchrun --nproc_per_node 1 repro.py

Before PR

[rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/output_code.py", line 466, in __call__
[rank0]:     return self.current_callable(inputs)
[rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/utils.py", line 2128, in run
[rank0]:     return model(new_inputs)
[rank0]:   File "/tmp/torchinductor_root/gp/cgpe6weswyihhm442ugdhqxypbr7urxgk3adfr25onncik6tvthr.py", line 423, in call
[rank0]:     triton_poi_fused_index_put_lift_fresh_2.run(buf9, buf8, 256, grid=grid(256), stream=stream0)
[rank0]: UnboundLocalError: local variable 'buf9' referenced before assignment

Note the simpler repro fails for both CUDA/ROCm and shows a logic issue across PT2.6, more details in gist.

… those buffers

rocm-repo-management-api · 2025-06-16T12:55:41Z

Jenkins build for 031bef105e88333bdde283491951000086ed5722 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

jataylo · 2025-06-24T14:29:50Z

@jithunnair-amd

Ensure fused nodes that allocate buffers come before kernels that use…

031bef1

… those buffers

jataylo requested review from jithunnair-amd and pruthvistony June 23, 2025 09:05

jataylo marked this pull request as ready for review June 23, 2025 09:06

jithunnair-amd merged commit 8b22352 into ROCm:release/2.6 Jun 27, 2025
1 of 6 checks passed

jithunnair-amd changed the title ~~[SWDEV-531526] [SWDEV-527340] Allocation of buffers ordered before compute~~ [release/2.6] [SWDEV-531526] [SWDEV-527340] Allocation of buffers ordered before compute Jun 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[release/2.6] [SWDEV-531526] [SWDEV-527340] Allocation of buffers ordered before compute #2276

[release/2.6] [SWDEV-531526] [SWDEV-527340] Allocation of buffers ordered before compute #2276

Uh oh!

jataylo commented Jun 16, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Jun 16, 2025 •

edited

Loading

Uh oh!

jataylo commented Jun 24, 2025

Uh oh!

Uh oh!

Uh oh!

[release/2.6] [SWDEV-531526] [SWDEV-527340] Allocation of buffers ordered before compute #2276

[release/2.6] [SWDEV-531526] [SWDEV-527340] Allocation of buffers ordered before compute #2276

Uh oh!

Conversation

jataylo commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jataylo commented Jun 24, 2025

Uh oh!

Uh oh!

Uh oh!

jataylo commented Jun 16, 2025 •

edited

Loading

rocm-repo-management-api bot commented Jun 16, 2025 •

edited

Loading