[Inductor] Improve memory locality by iterating over y dimension before x (pytorch#149339)

blaine-rister · pytorchmergebot · commit 970ac2d90719 · 2025-03-20T08:12:00.000Z
# Feature Fixes pytorch#148718 by reordering the tensor dims to `(z, y, x)`. As a bonus refactor, block pointers no longer needed the `reorder=True` argument to `self.active_range_trees()`. Since this argument is no longer used anywhere, this PR simply deletes it as opposed to updating the logic for the new iteration order. # Perf impact It looks like there's a decent perf bump on A100, with cudagraphs enabled. Granted, perf runs seem to have some noise between commits. ([Workflow run](https://github.com/pytorch/pytorch/actions/runs/13914815576).) Training (all neutral or positive): ![image](https://github.com/user-attachments/assets/57f1ef1d-60b4-446f-baf3-aca87a26b81b) Inference (one positive, one very small negative): ![image](https://github.com/user-attachments/assets/679aa057-af23-47f1-8d8e-8520daf1bd92) As reported in pytorch#148718, this PR makes consecutive threads access consecutive memory addresses. This should theoretically give the GPU more opportunities to coalesce loads and stores. From Nvidia's [kernel profiling guide](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html): > Local memory is private storage for an executing thread and is not visible outside of that thread. It is intended for thread-local data like thread stacks and register spills. Local memory addresses are translated to global virtual addresses by the AGU unit. Local memory has the same latency as global memory. One difference between global and local memory is that local memory is arranged such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address (e.g., same index in an array variable, same member in a structure variable, etc.). I couldn't find any information on how coalescing works for other kinds of memory, but the guide mentions it is also supported for accesses to the L2 cache. > The L2 Request Coalescer (LRC) processes incoming requests for L2 and tries to coalesce read requests before forwarding them to the L2 cache. It also serves programmatic multicast requests from the SM and supports compression for writes. The [answer to this Stack Overflow post](https://stackoverflow.com/a/5044424) also explains coalescing in a straightforward way. Inductor's current iteration order corresponds to the first (uncoalesced) example in that answer, while the order after this PR corresponds to the second (coalesced) example. Besides GPUs, this order of accessing data is highly advantageous for systems relying on DMAs, as those are designed to access contiguous spans of memory. This change improves the performance of an elementwise add kernel on an internal model, using internal hardware, by 1.76x. I will share the details with reviewers who are Meta employees via a private channel. # Test plan - Updated expected code on CI tests. - Added a new test checking the {x,y,z}indices and block pointers on a 3D pointwise kernel. Pull Request resolved: pytorch#149339 Approved by: https://github.com/jansel
diff --git a/test/inductor/test_torchinductor_strided_blocks.py b/test/inductor/test_torchinductor_strided_blocks.py
@@ -116,6 +116,9 @@ def _assert_reduction_ndims(self, code, num_dims: int) -> None:
         for unexpected_block in reduction_blocks[num_dims:]:
             self.assertNotIn(unexpected_block, code)
 
+    def _get_lines_containing_substr(self, code: str, substr: str) -> str:
+        return "\n".join(line for line in code.split("\n") if substr in line)
+
 
 @instantiate_parametrized_tests
 class CommonTemplate:
@@ -348,29 +351,29 @@ def test_pointwise_broadcast_nonzero_strides(self, prefer_nd_tiling: bool):
         # Check the code for broadcasts.
         # We shouldn't see any strides of 0.
         load_lines, store_lines = tuple(
-            [line for line in triton_code.split("\n") if substr in line]
+            self._get_lines_containing_substr(triton_code, substr)
             for substr in ("tl.load", "tl.store")
         )
         if prefer_nd_tiling:
             self.assertExpectedInline(
-                "\n".join(load_lines),
+                load_lines,
                 """\
-    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[8, 8], strides=[1, 8], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), boundary_check=[0, 1])
-    tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[8], strides=[8], block_shape=[YBLOCK], order=[0], offsets=[yoffset]), boundary_check=[0], eviction_policy='evict_last')[None, :]""",  # noqa: B950
+    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[8, 8], strides=[8, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), boundary_check=[0, 1])
+    tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[8], strides=[8], block_shape=[YBLOCK], order=[0], offsets=[yoffset]), boundary_check=[0], eviction_policy='evict_last')[:, None]""",  # noqa: B950
             )
             self.assertExpectedInline(
-                "\n".join(store_lines),
-                """    tl.store(tl.make_block_ptr(out_ptr0, shape=[8, 8], strides=[1, 8], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), tl.broadcast_to(tmp2, [XBLOCK, YBLOCK]).to(tl.float32), boundary_check=[0, 1])""",  # noqa: B950
+                store_lines,
+                """    tl.store(tl.make_block_ptr(out_ptr0, shape=[8, 8], strides=[8, 1], block_shape=[YBLOCK, XBLOCK], order=[1, 0], offsets=[yoffset, xoffset]), tl.broadcast_to(tmp2, [YBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1])""",  # noqa: B950
             )
         else:
             self.assertExpectedInline(
-                "\n".join(load_lines),
+                load_lines,
                 """\
     tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0])
     tmp1 = tl.reshape(tl.broadcast_to(tl.load(tl.make_block_ptr(in_ptr1, shape=[8], strides=[8], block_shape=[(7 + XBLOCK) // 8], order=[0], offsets=[xoffset // 8]), boundary_check=[0], eviction_policy='evict_last')[:, None, None], [(7 + XBLOCK) // 8, ((1) * ((1) <= ((7 + XBLOCK) // 8)) + ((7 + XBLOCK) // 8) * (((7 + XBLOCK) // 8) < (1))), ((8) * ((8) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (8)))]), [XBLOCK])""",  # noqa: B950
             )
             self.assertExpectedInline(
-                "\n".join(store_lines),
+                store_lines,
                 """    tl.store(tl.make_block_ptr(out_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0])""",  # noqa: B950
             )
 
@@ -952,6 +955,54 @@ def fn(a):
             rtol=0.06,
         )
 
+    def test_pointwise_index_order(self):
+        """
+        Test the order of indices in pointwise kernels. Expect Z to be the leading dim,
+        then Y, then X.
+        """
+
+        inps = [
+            self._discontiguous_tensor((5, 5, 5), device=self.device) for _ in range(2)
+        ]
+
+        result, (triton_code,) = run_and_compare(
+            self,
+            torch.add,
+            *inps,
+            expected_num_triton_kernels=1,
+            expected_num_block_pointers=3,
+            config_patches={
+                "triton.max_tiles": 3,
+                "triton.prefer_nd_tiling": True,
+            },
+        )
+
+        # Check the load and store for block pointer strides.
+        load_lines, store_lines, index_lines = tuple(
+            self._get_lines_containing_substr(triton_code, substr)
+            for substr in ("tl.load", "tl.store", "index =")
+        )
+        self.assertExpectedInline(
+            load_lines,
+            """\
+    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[5, 5, 5], strides=[100, 10, 1], block_shape=[ZBLOCK, YBLOCK, XBLOCK], order=[2, 1, 0], offsets=[zoffset, yoffset, xoffset]), boundary_check=[0, 1, 2])
+    tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[5, 5, 5], strides=[100, 10, 1], block_shape=[ZBLOCK, YBLOCK, XBLOCK], order=[2, 1, 0], offsets=[zoffset, yoffset, xoffset]), boundary_check=[0, 1, 2])""",  # noqa: B950
+        )
+
+        self.assertExpectedInline(
+            store_lines,
+            """    tl.store(tl.make_block_ptr(out_ptr0, shape=[5, 5, 5], strides=[25, 5, 1], block_shape=[ZBLOCK, YBLOCK, XBLOCK], order=[2, 1, 0], offsets=[zoffset, yoffset, xoffset]), tl.broadcast_to(tmp2, [ZBLOCK, YBLOCK, XBLOCK]).to(tl.float32), boundary_check=[0, 1, 2])""",  # noqa: B950
+        )
+
+        # Check the indices. These are used for non-block pointers.
+        self.assertExpectedInline(
+            index_lines,
+            """\
+    zindex = zoffset + tl.arange(0, ZBLOCK)[:, None, None]
+    yindex = yoffset + tl.arange(0, YBLOCK)[None, :, None]
+    xindex = xoffset + tl.arange(0, XBLOCK)[None, None, :]""",  # noqa: B950
+        )
+
 
 @unittest.skipIf(not TRITON_HAS_CPU, "requires triton CPU backend")
 @config.patch(cpu_backend="triton")
diff --git a/torch/_inductor/codegen/simd.py b/torch/_inductor/codegen/simd.py
@@ -424,13 +424,14 @@ def filtered_index_map(seq, mask) -> dict[Any, int]:
             }
 
         grid_dims = ["x", "y", "z"]
+        pointwise_tensor_dims = list(reversed(grid_dims))
         reduction_dims = ["r0_", "r1_"]
         if no_x_dim:
             tensor_dims = reduction_dims
         elif no_r_dim:
-            tensor_dims = grid_dims
+            tensor_dims = pointwise_tensor_dims
         else:
-            tensor_dims = grid_dims + reduction_dims
+            tensor_dims = pointwise_tensor_dims + reduction_dims
 
         # Filter out unused tensor dims.
         # Convert to dicts for O(1) index lookup.
@@ -814,17 +815,10 @@ def prepare_indexing(
 
         return self.codegen_indexing(simp_index)
 
-    def active_range_trees(self, reorder: bool = False) -> list[IterationRangesRoot]:
-        trees = [
+    def active_range_trees(self) -> list[IterationRangesRoot]:
+        return [
             t for t in self.range_trees if not t.is_reduction or self.inside_reduction
         ]
-        if reorder and len(trees) > 1:
-            count = sum(t.prefix in "xyz" for t in trees)
-            assert "".join(t.prefix for t in trees[:count]) == "zyx"[-count:], [
-                t.prefix for t in trees[:count]
-            ]
-            trees[:count] = reversed(trees[:count])
-        return trees
 
     def codegen_indexing(self, expr: sympy.Expr) -> sympy.Expr:
         expr = V.graph.sizevars.simplify_with_ranges(expr, self.var_ranges())
diff --git a/torch/_inductor/codegen/triton.py b/torch/_inductor/codegen/triton.py
@@ -1966,7 +1966,7 @@ def match_block_pointer() -> Optional[BlockPtrOptions]:
                 index_relative_to_xyr_index = sympy_subs(
                     index, {v: t.expr for v, t in self.range_tree_nodes.items()}
                 )
-                range_trees = self.active_range_trees(reorder=True)
+                range_trees = self.active_range_trees()
 
                 # Partition the index into subexpressions pertaining to each range tree.
                 # For example xindex * 5 + r0_index * 3 is partitioned to

Original file line number	Diff line number	Diff line change
`@@ -1966,7 +1966,7 @@ def match_block_pointer() -> Optional[BlockPtrOptions]:`
`1966`	`1966`	`index_relative_to_xyr_index = sympy_subs(`
`1967`	`1967`	`index, {v: t.expr for v, t in self.range_tree_nodes.items()}`
`1968`	`1968`	`)`
`1969`		`- range_trees = self.active_range_trees(reorder=True)`
	`1969`	`+ range_trees = self.active_range_trees()`
`1970`	`1970`
`1971`	`1971`	`# Partition the index into subexpressions pertaining to each range tree.`
`1972`	`1972`	`# For example xindex * 5 + r0_index * 3 is partitioned to`