Skip to content

[release/2.6] [SWDEV-531526] [SWDEV-527340] Allocation of buffers ordered before compute #2276

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 27, 2025

Conversation

jataylo
Copy link

@jataylo jataylo commented Jun 16, 2025

Ensure fused nodes that allocate buffers come before kernels that usethose buffers

In one example we observed:

  • op8 creates buf10 which mutates buf8
  • triton_poi_fused_index_put_lift_fresh_2 kernel tries to use buf8 and buf9
  • op6_op7_op16 (fused node) creates buf8 and buf9

But the standard topological sort didn't ensure that the fused node creating buf8 and buf9 came before the kernel using them.

After this PR we will identify op8 performs a mutation on buf8, find the node that is responsible for creating the buffer (op6_op7_op16) and add an explicit dependency so now op8 depends on op6_op7_op16 and orders graph accordingly.

Note this issue is not seen in PT2.7, not clear as to why. We will hold back on upstreaming this until we observe a similar issue on nightly.

Reproducer code (simplified from megatron)
https://gist.github.com/jataylo/10bedef08323441c588d2965ad963ae8

Execute with

torchrun --nproc_per_node 1 repro.py

Before PR

[rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/output_code.py", line 466, in __call__
[rank0]:     return self.current_callable(inputs)
[rank0]:   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/utils.py", line 2128, in run
[rank0]:     return model(new_inputs)
[rank0]:   File "/tmp/torchinductor_root/gp/cgpe6weswyihhm442ugdhqxypbr7urxgk3adfr25onncik6tvthr.py", line 423, in call
[rank0]:     triton_poi_fused_index_put_lift_fresh_2.run(buf9, buf8, 256, grid=grid(256), stream=stream0)
[rank0]: UnboundLocalError: local variable 'buf9' referenced before assignment

Note the simpler repro fails for both CUDA/ROCm and shows a logic issue across PT2.6, more details in gist.

@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jun 16, 2025

Jenkins build for 031bef105e88333bdde283491951000086ed5722 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

@jataylo jataylo marked this pull request as ready for review June 23, 2025 09:06
@jataylo
Copy link
Author

jataylo commented Jun 24, 2025

@jithunnair-amd

@jithunnair-amd jithunnair-amd merged commit 8b22352 into ROCm:release/2.6 Jun 27, 2025
1 of 6 checks passed
@jithunnair-amd jithunnair-amd changed the title [SWDEV-531526] [SWDEV-527340] Allocation of buffers ordered before compute [release/2.6] [SWDEV-531526] [SWDEV-527340] Allocation of buffers ordered before compute Jun 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants