[mlir][vector] Tighten the semantics of vector.{load|store} #135151

banach-space · 2025-04-10T10:14:22Z

This change refines the verifier for vector.load and vector.store to
disallow the use of vectors with higher rank than the source or
destination memref. For example, the following is now rejected:

  %0 = vector.load %src[%c0] : memref<?xi8>, vector<16x16xi8>
  vector.store %vec, %dest[%c0] : memref<?xi8>, vector<16x16xi8>

This pattern was previously used in SME end-to-end tests and "happened"
to work by implicitly assuming row-major memory layout. However, there
is no guarantee that such an assumption will always hold, and we should
avoid relying on it unless it can be enforced deterministically.

Notably, production ArmSME lowering pipelines do not rely on this
behavior. Instead, the expected usage (illustrated here with scalable
vector syntax) would be:

  %0 = vector.load %src[%c0, %c0] : memref<?x?xi8>, vector<[16]x[16]xi8>

This PR updates the verifier accordingly and adjusts all affected tests.
These tests are either removed (if no longer relevant) or updated to use
memrefs with appropriately matching rank.

This change refines the verifier for `vector.load` and `vector.store` to disallow the use of vectors with higher rank than the source or destination memref. For example, the following is now rejected: ```mlir %0 = vector.load %src[%c0] : memref<?xi8>, vector<16x16xi8> vector.store %vec, %dest[%c0] : memref<?xi8>, vector<16x16xi8> ``` This pattern was previously used in SME end-to-end tests and "happened" to work by implicitly assuming row-major memory layout. However, there is no guarantee that such an assumption will always hold, and we should avoid relying on it unless it can be enforced deterministically. Notably, production ArmSME lowering pipelines do not rely on this behavior. Instead, the expected usage (illustrated here with scalable vector syntax) would be: ```mlir %0 = vector.load %src[%c0, %c0] : memref<?x?xi8>, vector<[16]x[16]xi8> ``` This PR updates the verifier accordingly and adjusts all affected tests. These tests are either removed (if no longer relevant) or updated to use memrefs with appropriately matching rank.

llvmbot · 2025-04-10T10:14:55Z

@llvm/pr-subscribers-mlir-sme

@llvm/pr-subscribers-mlir

Author: Andrzej Warzyński (banach-space)

Changes

This change refines the verifier for vector.load and vector.store to
disallow the use of vectors with higher rank than the source or
destination memref. For example, the following is now rejected:

  %0 = vector.load %src[%c0] : memref&lt;?xi8&gt;, vector&lt;16x16xi8&gt;
  vector.store %vec, %dest[%c0] : memref&lt;?xi8&gt;, vector&lt;16x16xi8&gt;

This pattern was previously used in SME end-to-end tests and "happened"
to work by implicitly assuming row-major memory layout. However, there
is no guarantee that such an assumption will always hold, and we should
avoid relying on it unless it can be enforced deterministically.

Notably, production ArmSME lowering pipelines do not rely on this
behavior. Instead, the expected usage (illustrated here with scalable
vector syntax) would be:

  %0 = vector.load %src[%c0, %c0] : memref&lt;?x?xi8&gt;, vector&lt;[16]x[16]xi8&gt;

This PR updates the verifier accordingly and adjusts all affected tests.
These tests are either removed (if no longer relevant) or updated to use
memrefs with appropriately matching rank.

Patch is 20.83 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/135151.diff

7 Files Affected:

(modified) mlir/lib/Dialect/Vector/IR/VectorOps.cpp (+7)
(modified) mlir/test/Conversion/VectorToArmSME/vector-to-arm-sme.mlir (-12)
(modified) mlir/test/Dialect/MemRef/fold-memref-alias-ops.mlir (+45-22)
(modified) mlir/test/Dialect/Vector/invalid.mlir (+28-4)
(modified) mlir/test/Dialect/Vector/vector-transfer-to-vector-load-store.mlir (+4-4)
(modified) mlir/test/Integration/Dialect/Vector/CPU/ArmSME/transpose.mlir (+4-4)
(modified) mlir/test/Integration/Dialect/Vector/CPU/ArmSME/vector-load-store.mlir (+42-39)

diff --git a/mlir/lib/Dialect/Vector/IR/VectorOps.cpp b/mlir/lib/Dialect/Vector/IR/VectorOps.cpp
index 98d98f067de14..8b70a6b60a1ec 100644
--- a/mlir/lib/Dialect/Vector/IR/VectorOps.cpp
+++ b/mlir/lib/Dialect/Vector/IR/VectorOps.cpp
@@ -5099,6 +5099,10 @@ LogicalResult vector::LoadOp::verify() {
   if (failed(verifyLoadStoreMemRefLayout(*this, resVecTy, memRefTy)))
     return failure();
 
+  if (memRefTy.getRank() < resVecTy.getRank())
+    return emitOpError(
+        "destination memref has lower rank than the result vector");
+
   // Checks for vector memrefs.
   Type memElemTy = memRefTy.getElementType();
   if (auto memVecTy = llvm::dyn_cast<VectorType>(memElemTy)) {
@@ -5131,6 +5135,9 @@ LogicalResult vector::StoreOp::verify() {
   if (failed(verifyLoadStoreMemRefLayout(*this, valueVecTy, memRefTy)))
     return failure();
 
+  if (memRefTy.getRank() < valueVecTy.getRank())
+    return emitOpError("source memref has lower rank than the vector to store");
+
   // Checks for vector memrefs.
   Type memElemTy = memRefTy.getElementType();
   if (auto memVecTy = llvm::dyn_cast<VectorType>(memElemTy)) {
diff --git a/mlir/test/Conversion/VectorToArmSME/vector-to-arm-sme.mlir b/mlir/test/Conversion/VectorToArmSME/vector-to-arm-sme.mlir
index 0f973af799634..c8a434bb8f5de 100644
--- a/mlir/test/Conversion/VectorToArmSME/vector-to-arm-sme.mlir
+++ b/mlir/test/Conversion/VectorToArmSME/vector-to-arm-sme.mlir
@@ -718,18 +718,6 @@ func.func @vector_load_i8_with_offset(%arg0 : memref<?x?xi8>) -> vector<[16]x[16
 
 // -----
 
-// CHECK-LABEL: @vector_load_i8_from_rank_1_memref(
-// CHECK-SAME:                                     %[[MEMREF:.*]]: memref<?xi8>)
-// CHECK: %[[C0:.*]] = arith.constant 0 : index
-// CHECK: arm_sme.tile_load %[[MEMREF]][%[[C0]]] : memref<?xi8>, vector<[16]x[16]xi8>
-func.func @vector_load_i8_from_rank_1_memref(%arg0 : memref<?xi8>) -> vector<[16]x[16]xi8> {
-  %c0 = arith.constant 0 : index
-  %tile = vector.load %arg0[%c0] : memref<?xi8>, vector<[16]x[16]xi8>
-  return %tile : vector<[16]x[16]xi8>
-}
-
-// -----
-
 // CHECK-LABEL: @vector_load_i16(
 // CHECK: arm_sme.tile_load {{.*}} : memref<?x?xi16>, vector<[8]x[8]xi16>
 func.func @vector_load_i16(%arg0 : memref<?x?xi16>) -> vector<[8]x[8]xi16> {
diff --git a/mlir/test/Dialect/MemRef/fold-memref-alias-ops.mlir b/mlir/test/Dialect/MemRef/fold-memref-alias-ops.mlir
index 067cdb5c5fd20..3160fd9c65c04 100644
--- a/mlir/test/Dialect/MemRef/fold-memref-alias-ops.mlir
+++ b/mlir/test/Dialect/MemRef/fold-memref-alias-ops.mlir
@@ -819,18 +819,29 @@ func.func @test_ldmatrix(%arg0: memref<4x32x32xf16, 3>, %arg1: index, %arg2: ind
 
 // -----
 
-func.func @fold_vector_load_subview(
-  %arg0 : memref<12x32xf32>, %arg1 : index, %arg2 : index) -> vector<12x32xf32> {
-  %0 = memref.subview %arg0[%arg1, %arg2][1, 1][1, 1] : memref<12x32xf32> to memref<f32, strided<[], offset: ?>>
-  %1 = vector.load %0[] : memref<f32, strided<[], offset: ?>>, vector<12x32xf32>
-  return %1 : vector<12x32xf32>
+func.func @fold_vector_load_subview(%src : memref<24x64xf32>,
+                                    %off1 : index,
+                                    %off2 : index,
+                                    %dim1 : index,
+                                    %dim2 : index,
+                                    %idx : index) -> vector<12x32xf32> {
+
+    %0 = memref.subview %src[%off1, %off2][%dim1, %dim2][1, 1] : memref<24x64xf32> to memref<?x?xf32, strided<[64, 1], offset: ?>>
+    %1 = vector.load %0[%idx, %idx] :  memref<?x?xf32, strided<[64, 1], offset: ?>>, vector<12x32xf32>
+    return %1 : vector<12x32xf32>
 }
 
-//      CHECK: func @fold_vector_load_subview
-// CHECK-SAME:   %[[ARG0:[a-zA-Z0-9_]+]]: memref<12x32xf32>
-// CHECK-SAME:   %[[ARG1:[a-zA-Z0-9_]+]]: index
-// CHECK-SAME:   %[[ARG2:[a-zA-Z0-9_]+]]: index
-//      CHECK:   vector.load %[[ARG0]][%[[ARG1]], %[[ARG2]]] :  memref<12x32xf32>, vector<12x32xf32>
+// CHECK: #[[$ATTR_46:.+]] = affine_map<()[s0, s1] -> (s0 + s1)>
+// CHECK-LABEL:   func.func @fold_vector_load_subview(
+// CHECK-SAME:      %[[SRC:[a-zA-Z0-9$._-]*]]: memref<24x64xf32>,
+// CHECK-SAME:      %[[OFF_1:[a-zA-Z0-9$._-]*]]: index,
+// CHECK-SAME:      %[[OFF_2:[a-zA-Z0-9$._-]*]]: index,
+// CHECK-SAME:      %[[DIM_1:[a-zA-Z0-9$._-]*]]: index,
+// CHECK-SAME:      %[[DIM_2:[a-zA-Z0-9$._-]*]]: index,
+// CHECK-SAME:      %[[IDX:[a-zA-Z0-9$._-]*]]: index) -> vector<12x32xf32> {
+// CHECK:           %[[VAL_6:.*]] = affine.apply #[[$ATTR_46]](){{\[}}%[[OFF_1]], %[[IDX]]]
+// CHECK:           %[[VAL_7:.*]] = affine.apply #[[$ATTR_46]](){{\[}}%[[OFF_2]], %[[IDX]]]
+// CHECK:           %[[VAL_8:.*]] = vector.load %[[SRC]]{{\[}}%[[VAL_6]], %[[VAL_7]]] : memref<24x64xf32>, vector<12x32xf32>
 
 // -----
 
@@ -851,20 +862,32 @@ func.func @fold_vector_maskedload_subview(
 
 // -----
 
-func.func @fold_vector_store_subview(
-  %arg0 : memref<12x32xf32>, %arg1 : index, %arg2 : index, %arg3: vector<2x32xf32>) -> () {
-  %0 = memref.subview %arg0[%arg1, %arg2][1, 1][1, 1] : memref<12x32xf32> to memref<f32, strided<[], offset: ?>>
-  vector.store %arg3, %0[] : memref<f32, strided<[], offset: ?>>, vector<2x32xf32>
-  return
+func.func @fold_vector_store_subview(%src : memref<24x64xf32>,
+                                     %off1 : index,
+                                     %off2 : index,
+                                     %vec: vector<2x32xf32>,
+                                     %idx : index,
+                                     %dim1 : index,
+                                     %dim2 : index) -> () {
+
+    %0 = memref.subview %src[%off1, %off2][%dim1, %dim2][1, 1] : memref<24x64xf32> to memref<?x?xf32, strided<[64, 1], offset: ?>>
+    vector.store %vec, %0[%idx, %idx] : memref<?x?xf32, strided<[64, 1], offset: ?>> , vector<2x32xf32>
+    return
 }
 
-//      CHECK: func @fold_vector_store_subview
-// CHECK-SAME:   %[[ARG0:[a-zA-Z0-9_]+]]: memref<12x32xf32>
-// CHECK-SAME:   %[[ARG1:[a-zA-Z0-9_]+]]: index
-// CHECK-SAME:   %[[ARG2:[a-zA-Z0-9_]+]]: index
-// CHECK-SAME:   %[[ARG3:[a-zA-Z0-9_]+]]: vector<2x32xf32>
-//      CHECK:   vector.store %[[ARG3]], %[[ARG0]][%[[ARG1]], %[[ARG2]]] :  memref<12x32xf32>, vector<2x32xf32>
-//      CHECK:   return
+// CHECK: #[[$ATTR_47:.+]] = affine_map<()[s0, s1] -> (s0 + s1)>
+
+// CHECK-LABEL:   func.func @fold_vector_store_subview(
+// CHECK-SAME:      %[[SRC:[a-zA-Z0-9$._-]*]]: memref<24x64xf32>,
+// CHECK-SAME:      %[[OFF1:[a-zA-Z0-9$._-]*]]: index,
+// CHECK-SAME:      %[[OFF_2:[a-zA-Z0-9$._-]*]]: index,
+// CHECK-SAME:      %[[VEC:[a-zA-Z0-9$._-]*]]: vector<2x32xf32>,
+// CHECK-SAME:      %[[IDX:[a-zA-Z0-9$._-]*]]: index,
+// CHECK-SAME:      %[[VAL_5:[a-zA-Z0-9$._-]*]]: index,
+// CHECK-SAME:      %[[VAL_6:[a-zA-Z0-9$._-]*]]: index) {
+//      CHECK:      %[[VAL_7:.*]] = affine.apply #[[$ATTR_47]](){{\[}}%[[OFF1]], %[[IDX]]]
+//      CHECK:      %[[VAL_8:.*]] = affine.apply #[[$ATTR_47]](){{\[}}%[[OFF_2]], %[[IDX]]]
+//      CHECK:      vector.store %[[VEC]], %[[SRC]]{{\[}}%[[VAL_7]], %[[VAL_8]]] : memref<24x64xf32>, vector<2x32xf32>
 
 // -----
 
diff --git a/mlir/test/Dialect/Vector/invalid.mlir b/mlir/test/Dialect/Vector/invalid.mlir
index ea6d0021391fb..f7192fbf68b4e 100644
--- a/mlir/test/Dialect/Vector/invalid.mlir
+++ b/mlir/test/Dialect/Vector/invalid.mlir
@@ -1743,13 +1743,11 @@ func.func @invalid_outerproduct(%src : memref<?xf32>) {
 
 // -----
 
-func.func @invalid_outerproduct1(%src : memref<?xf32>) {
+func.func @invalid_outerproduct1(%src : memref<?xf32>, %lhs : vector<[4]x[4]xf32>, %rhs : vector<[4]xf32>) {
   %idx = arith.constant 0 : index
-  %0 = vector.load %src[%idx] : memref<?xf32>, vector<[4]x[4]xf32>
-  %1 = vector.load %src[%idx] : memref<?xf32>, vector<[4]xf32>
 
   // expected-error @+1 {{'vector.outerproduct' op expected 1-d vector for operand #1}}
-  %op = vector.outerproduct %0, %1 : vector<[4]x[4]xf32>, vector<[4]xf32>
+  %op = vector.outerproduct %lhs, %rhs : vector<[4]x[4]xf32>, vector<[4]xf32>
 }
 
 // -----
@@ -1870,3 +1868,29 @@ func.func @flat_transpose_scalable(%arg0: vector<[16]xf32>) -> vector<[16]xf32>
      : vector<[16]xf32> -> vector<[16]xf32>
   return %0 : vector<[16]xf32>
 }
+
+// -----
+
+//===----------------------------------------------------------------------===//
+// vector.load
+//===----------------------------------------------------------------------===//
+
+func.func @vector_load(%src : memref<?xi8>) {
+  %c0 = arith.constant 0 : index
+  // expected-error @+1 {{'vector.load' op destination memref has lower rank than the result vector}}
+  %0 = vector.load %src[%c0] : memref<?xi8>, vector<16x16xi8>
+  return
+}
+
+// -----
+
+//===----------------------------------------------------------------------===//
+// vector.store
+//===----------------------------------------------------------------------===//
+
+func.func @vector_store(%dest : memref<?xi8>, %vec : vector<16x16xi8>) {
+  %c0 = arith.constant 0 : index
+  // expected-error @+1 {{'vector.store' op source memref has lower rank than the vector to store}}
+  vector.store %vec, %dest[%c0] : memref<?xi8>, vector<16x16xi8>
+  return
+}
diff --git a/mlir/test/Dialect/Vector/vector-transfer-to-vector-load-store.mlir b/mlir/test/Dialect/Vector/vector-transfer-to-vector-load-store.mlir
index fd50acf03e79b..511ab70f35086 100644
--- a/mlir/test/Dialect/Vector/vector-transfer-to-vector-load-store.mlir
+++ b/mlir/test/Dialect/Vector/vector-transfer-to-vector-load-store.mlir
@@ -2,8 +2,8 @@
 
 // CHECK-LABEL: func @vector_transfer_ops_0d_memref(
 //  CHECK-SAME:   %[[MEM:.*]]: memref<f32>
-//  CHECK-SAME:   %[[VEC:.*]]: vector<1x1x1xf32>
-func.func @vector_transfer_ops_0d_memref(%mem: memref<f32>, %vec: vector<1x1x1xf32>) {
+//  CHECK-SAME:   %[[VEC:.*]]: vector<f32>
+func.func @vector_transfer_ops_0d_memref(%mem: memref<f32>, %vec: vector<f32>) {
     %f0 = arith.constant 0.0 : f32
 
 //  CHECK-NEXT:   %[[S:.*]] = vector.load %[[MEM]][] : memref<f32>, vector<f32>
@@ -12,8 +12,8 @@ func.func @vector_transfer_ops_0d_memref(%mem: memref<f32>, %vec: vector<1x1x1xf
 //  CHECK-NEXT:   vector.store %[[S]], %[[MEM]][] : memref<f32>, vector<f32>
     vector.transfer_write %0, %mem[] : vector<f32>, memref<f32>
 
-//  CHECK-NEXT:   vector.store %[[VEC]], %[[MEM]][] : memref<f32>, vector<1x1x1xf32>
-    vector.store %vec, %mem[] : memref<f32>, vector<1x1x1xf32>
+//  CHECK-NEXT:   vector.store %[[VEC]], %[[MEM]][] : memref<f32>, vector<f32>
+    vector.store %vec, %mem[] : memref<f32>, vector<f32>
 
     return
 }
diff --git a/mlir/test/Integration/Dialect/Vector/CPU/ArmSME/transpose.mlir b/mlir/test/Integration/Dialect/Vector/CPU/ArmSME/transpose.mlir
index ff20f99b63cd1..b44658eef4e11 100644
--- a/mlir/test/Integration/Dialect/Vector/CPU/ArmSME/transpose.mlir
+++ b/mlir/test/Integration/Dialect/Vector/CPU/ArmSME/transpose.mlir
@@ -17,7 +17,7 @@ func.func @entry() {
   %za_s_size = arith.muli %svl_s, %svl_s : index
 
   // Allocate memory.
-  %mem1 = memref.alloca(%za_s_size) : memref<?xi32>
+  %mem1 = memref.alloca(%za_s_size, %svl_s) : memref<?x?xi32>
 
   // Fill each "row" of "mem1" with row number.
   //
@@ -29,15 +29,15 @@ func.func @entry() {
   //   3, 3, 3, 3
   //
   %init_0 = arith.constant 0 : i32
-  scf.for %i = %c0 to %za_s_size step %svl_s iter_args(%val = %init_0) -> (i32) {
+  scf.for %i = %c0 to %svl_s step %c1 iter_args(%val = %init_0) -> (i32) {
     %splat_val = vector.broadcast %val : i32 to vector<[4]xi32>
-    vector.store %splat_val, %mem1[%i] : memref<?xi32>, vector<[4]xi32>
+    vector.store %splat_val, %mem1[%i, %c0] : memref<?x?xi32>, vector<[4]xi32>
     %val_next = arith.addi %val, %c1_i32 : i32
     scf.yield %val_next : i32
   }
 
   // Load tile from "mem1".
-  %tile = vector.load %mem1[%c0] : memref<?xi32>, vector<[4]x[4]xi32>
+  %tile = vector.load %mem1[%c0, %c0] : memref<?x?xi32>, vector<[4]x[4]xi32>
 
   // Transpose tile.
   %transposed_tile = vector.transpose %tile, [1, 0] : vector<[4]x[4]xi32> to vector<[4]x[4]xi32>
diff --git a/mlir/test/Integration/Dialect/Vector/CPU/ArmSME/vector-load-store.mlir b/mlir/test/Integration/Dialect/Vector/CPU/ArmSME/vector-load-store.mlir
index 6e25bee65f095..09d68661c6e9d 100644
--- a/mlir/test/Integration/Dialect/Vector/CPU/ArmSME/vector-load-store.mlir
+++ b/mlir/test/Integration/Dialect/Vector/CPU/ArmSME/vector-load-store.mlir
@@ -34,11 +34,11 @@ func.func @za0_d_f64() -> i32 {
   //   3.1, 3.1, 3.1, 3.1
   //
   %tilesize = arith.muli %svl_d, %svl_d : index
-  %mem1 = memref.alloca(%tilesize) : memref<?xf64>
+  %mem1 = memref.alloca(%svl_d, %svl_d) : memref<?x?xf64>
   %init_0 = arith.constant 0.1 : f64
-  scf.for %i = %c0 to %tilesize step %svl_d iter_args(%val = %init_0) -> (f64) {
+  scf.for %i = %c0 to %svl_d step %c1_index iter_args(%val = %init_0) -> (f64) {
     %splat_val = vector.broadcast %val : f64 to vector<[2]xf64>
-    vector.store %splat_val, %mem1[%i] : memref<?xf64>, vector<[2]xf64>
+    vector.store %splat_val, %mem1[%i, %c0] : memref<?x?xf64>, vector<[2]xf64>
     %val_next = arith.addf %val, %c1_f64 : f64
     scf.yield %val_next : f64
   }
@@ -48,27 +48,29 @@ func.func @za0_d_f64() -> i32 {
   //
   // CHECK-ZA0_D:      ( 0.1, 0.1
   // CHECK-ZA0_D-NEXT: ( 1.1, 1.1
-  scf.for %i = %c0 to %tilesize step %svl_d {
-    %tileslice = vector.load %mem1[%i] : memref<?xf64>, vector<[2]xf64>
+  scf.for %i = %c0 to %svl_d step %c1_index {
+    %tileslice = vector.load %mem1[%i, %c0] : memref<?x?xf64>, vector<[2]xf64>
     vector.print %tileslice : vector<[2]xf64>
   }
 
   // Load ZA0.D from "mem1"
-  %za0_d = vector.load %mem1[%c0] : memref<?xf64>, vector<[2]x[2]xf64>
+  %za0_d = vector.load %mem1[%c0, %c0] : memref<?x?xf64>, vector<[2]x[2]xf64>
 
   // Allocate "mem2" to store ZA0.D to
-  %mem2 = memref.alloca(%tilesize) : memref<?xf64>
+  %mem2 = memref.alloca(%svl_d, %svl_d) : memref<?x?xf64>
 
   // Zero "mem2"
-  scf.for %i = %c0 to %tilesize step %c1_index {
-    memref.store %c0_f64, %mem2[%i] : memref<?xf64>
+  scf.for %i = %c0 to %svl_d step %c1_index {
+    scf.for %j = %c0 to %svl_d step %c1_index {
+      memref.store %c0_f64, %mem2[%i, %j] : memref<?x?xf64>
+    }
   }
 
   // Verify "mem2" is zeroed by doing an add reduction with initial value of
   // zero
   %init_0_f64 = arith.constant 0.0 : f64
-  %add_reduce = scf.for %vnum = %c0 to %tilesize step %svl_d iter_args(%iter = %init_0_f64) -> (f64) {
-    %row = vector.load %mem2[%vnum] : memref<?xf64>, vector<[2]xf64>
+  %add_reduce = scf.for %vnum = %c0 to %svl_d step %c1_index iter_args(%iter = %init_0_f64) -> (f64) {
+    %row = vector.load %mem2[%vnum, %c0] : memref<?x?xf64>, vector<[2]xf64>
 
     %inner_add_reduce = scf.for %offset = %c0 to %svl_d step %c1_index iter_args(%inner_iter = %init_0_f64) -> (f64) {
       %t = vector.extractelement %row[%offset : index] : vector<[2]xf64>
@@ -88,16 +90,16 @@ func.func @za0_d_f64() -> i32 {
   //
   // CHECK-ZA0_D-NEXT: ( 0, 0
   // CHECK-ZA0_D-NEXT: ( 0, 0
-  scf.for %i = %c0 to %tilesize step %svl_d {
-    %tileslice = vector.load %mem2[%i] : memref<?xf64>, vector<[2]xf64>
+  scf.for %i = %c0 to  %svl_d  step %c1_index{
+    %tileslice = vector.load %mem2[%i, %c0] : memref<?x?xf64>, vector<[2]xf64>
     vector.print %tileslice : vector<[2]xf64>
   }
 
   // Verify "mem1" != "mem2"
   %init_1 = arith.constant 1 : i64
-  %mul_reduce_0 = scf.for %vnum = %c0 to %tilesize step %svl_d iter_args(%iter = %init_1) -> (i64) {
-    %row_1 = vector.load %mem1[%vnum] : memref<?xf64>, vector<[2]xf64>
-    %row_2 = vector.load %mem2[%vnum] : memref<?xf64>, vector<[2]xf64>
+  %mul_reduce_0 = scf.for %vnum = %c0 to %svl_d step %c1_index iter_args(%iter = %init_1) -> (i64) {
+    %row_1 = vector.load %mem1[%vnum, %c0] : memref<?x?xf64>, vector<[2]xf64>
+    %row_2 = vector.load %mem2[%vnum, %c0] : memref<?x?xf64>, vector<[2]xf64>
     %cmp = arith.cmpf one, %row_1, %row_2 : vector<[2]xf64>
 
     %inner_mul_reduce = scf.for %i = %c0 to %svl_d step %c1_index iter_args(%inner_iter = %init_1) -> (i64) {
@@ -115,12 +117,12 @@ func.func @za0_d_f64() -> i32 {
   vector.print %mul_reduce_0 : i64
 
   // Store ZA0.D to "mem2"
-  vector.store %za0_d, %mem2[%c0] : memref<?xf64>, vector<[2]x[2]xf64>
+  vector.store %za0_d, %mem2[%c0, %c0] : memref<?x?xf64>, vector<[2]x[2]xf64>
 
   // Verify "mem1" == "mem2"
-  %mul_reduce_1 = scf.for %vnum = %c0 to %tilesize step %svl_d iter_args(%iter = %init_1) -> (i64) {
-    %row_1 = vector.load %mem1[%vnum] : memref<?xf64>, vector<[2]xf64>
-    %row_2 = vector.load %mem2[%vnum] : memref<?xf64>, vector<[2]xf64>
+  %mul_reduce_1 = scf.for %vnum = %c0 to %svl_d step %c1_index iter_args(%iter = %init_1) -> (i64) {
+    %row_1 = vector.load %mem1[%vnum, %c0] : memref<?x?xf64>, vector<[2]xf64>
+    %row_2 = vector.load %mem2[%vnum, %c0] : memref<?x?xf64>, vector<[2]xf64>
     %cmp = arith.cmpf oeq, %row_1, %row_2 : vector<[2]xf64>
 
     %inner_mul_reduce = scf.for %i = %c0 to %svl_d step %c1_index iter_args(%inner_iter = %init_1) -> (i64) {
@@ -142,8 +144,8 @@ func.func @za0_d_f64() -> i32 {
   //
   // CHECK-ZA0_D-NEXT: ( 0.1, 0.1
   // CHECK-ZA0_D-NEXT: ( 1.1, 1.1
-  scf.for %i = %c0 to %tilesize step %svl_d {
-    %tileslice = vector.load %mem2[%i] : memref<?xf64>, vector<[2]xf64>
+  scf.for %i = %c0 to %svl_d step %c1_index{
+    %tileslice = vector.load %mem2[%i, %c0] : memref<?x?xf64>, vector<[2]xf64>
     vector.print %tileslice : vector<[2]xf64>
   }
 
@@ -170,8 +172,9 @@ func.func @load_store_two_za_s_tiles() -> i32 {
 
   // Allocate memory for two 32-bit element tiles.
   %size_of_tile = arith.muli %svl_s, %svl_s : index
+  %svl_s_x_2 = arith.muli %svl_s, %c2_index : index
   %size_of_two_tiles = arith.muli %size_of_tile, %c2_index : index
-  %mem1 = memref.alloca(%size_of_two_tiles) : memref<?xi32>
+  %mem1 = memref.alloca(%svl_s_x_2, %svl_s) : memref<?x?xi32>
 
   // Fill memory that tile 1 will be loaded from with '1' and '2' for tile 2.
   //
@@ -191,15 +194,15 @@ func.func @load_store_two_za_s_tiles() -> i32 {
   //   2, 2, 2, 2
   //   2, 2, 2, 2
   //
-  scf.for %i = %c0 to %size_of_two_tiles step %svl_s {
-    %isFirstTile = arith.cmpi ult, %i, %size_of_tile : index
+  scf.for %i = %c0 to %svl_s_x_2 step %c1_index {
+    %isFirstTile = arith.cmpi ult, %i, %svl_s : index
     %val = scf.if %isFirstTile -> i32 {
       scf.yield %c1_i32 : i32
     } else {
       scf.yield %c2_i32 : i32
     }
     %splat_val = vector.broadcast %val : i32 to vector<[4]xi32>
-    vector.store %splat_val, %mem1[%i] : memref<?xi32>, vector<[4]xi32>
+    vector.store %splat_val, %mem1[%i, %c0] : memref<?x?xi32>, vector<[4]xi32>
   }
 
   // Dump "mem1". The smallest SVL is 128-bits so each tile will be at least
@@ -213,32 +216,32 @@ func.func @load_store_two_za_s_tiles() -> i32 {
   // CHECK-NEXT: ( 2, 2, 2, 2
   // CHECK-NEXT: ( 2, 2, 2, 2
   // CHECK-NEXT: ( 2, 2, 2, 2
-  scf.for %i = %c0 to %size_of_two_tiles step %svl_s {
-    %tileslice = vector.load %mem1[%i] : memref<?xi32>, vector<[4]xi32>
+  scf.for %i = %c0 to %svl_s_x_2 step %c1_index {
+    %tileslice = vector.load %mem1[%i, %c0] : memref<?x?xi32>, vector<[4]xi32>
     vector.print %tileslice : vector<[4]xi32>
   }
 
   // Load tile 1 from memory
-  %za0_s = vector.load %mem1[%c0] : memref<?xi32>, vector<[4]x[4]xi32>
+  %za0_s = vector.load %mem1[%c0, %c0] : memref<?x?xi32>, vector<[4]x[4]xi32>
 
   // Load tile 2 from memory
-  %za1_s = vector.load %mem1[%size_of_tile] : memref<?xi32>, vector<[4]x[4]xi32>
+  %za1_s = vector.load %mem1[%svl_s, %c0] : memref<?x?xi32>, vector<[4]x[4]xi32>
 
   // Allocate new memory to store tiles to
-  %mem2 = memref.alloca(%size_of_two_tiles)  : memref<?xi32>
+  %mem2 = memref.alloca(%svl_s_x_2, %svl_s)  : memref<?x?xi32>
 
   // Zero new memory
-  scf.for %i = %c0 to %size_of_two_tiles step %c1_index {
-    memref.store %c0_i32, %mem2[%i] : memref<?xi32>
+  scf.for %i = %c0 to %svl_s_x_2 step %c1_index {
+    memref.store %c0_i32, %mem2[%i, %c0] : memref<?x?xi32>
   }
 
   // Stores tiles back to (new) memory in reverse order
 
   // Store tile 2 to memory
-  vector.store %za1_s, %mem2[%c0] : memref<?xi32>, vector<[4]x[4]xi32>
+  vector.store %za1_s, %mem2[%c0, %c0] : memref<?x?xi32>, vector<[4]x[4]xi32>
 
   // Store tile 1 to memory
-  vector.store %za0_s, %mem2[%size_of_tile] : memref<?xi32>, vector<[4]x[4]xi32>
+  vector.store %za0_s, %mem2[%svl_s, %c0] : memref<?x?xi32>, vector<[4]x[4]xi32>
 
   // Dump "mem2" and check the tiles were stored in reverse order. The smallest
   // SV...
[truncated]

MacDue · 2025-04-10T10:42:40Z

mlir/test/Conversion/VectorToArmSME/vector-to-arm-sme.mlir

-// CHECK-LABEL: @vector_load_i8_from_rank_1_memref(
-// CHECK-SAME:                                     %[[MEMREF:.*]]: memref<?xi8>)
-// CHECK: %[[C0:.*]] = arith.constant 0 : index
-// CHECK: arm_sme.tile_load %[[MEMREF]][%[[C0]]] : memref<?xi8>, vector<[16]x[16]xi8>


You could follow up to apply the same changes to ArmSME. There's some unused code ArmSMEToSCF.cpp for dealing with rank 1 memrefs (e.g. in getMemrefIndices) and it somewhat broken: #118769.

I need a few more cycles for this, but have posted a draft here: #135396

Let me know if you plan to review this further. If not, I will land this to unblock #134389.

Thanks for the pointer, Ben! 🙏🏻

dcaballe

Thanks

MacDue

LGTM, just spotted a few little things:

mlir/test/Integration/Dialect/Vector/CPU/ArmSME/transpose.mlir

mlir/test/Integration/Dialect/Vector/CPU/ArmSME/vector-load-store.mlir

Address comments from Ben

) This change refines the verifier for `vector.load` and `vector.store` to disallow the use of vectors with higher rank than the source or destination memref. For example, the following is now rejected: ```mlir %0 = vector.load %src[%c0] : memref<?xi8>, vector<16x16xi8> vector.store %vec, %dest[%c0] : memref<?xi8>, vector<16x16xi8> ``` This pattern was previously used in SME end-to-end tests and "happened" to work by implicitly assuming row-major memory layout. However, there is no guarantee that such an assumption will always hold, and we should avoid relying on it unless it can be enforced deterministically. Notably, production ArmSME lowering pipelines do not rely on this behavior. Instead, the expected usage (illustrated here with scalable vector syntax) would be: ```mlir %0 = vector.load %src[%c0, %c0] : memref<?x?xi8>, vector<[16]x[16]xi8> ``` This PR updates the verifier accordingly and adjusts all affected tests. These tests are either removed (if no longer relevant) or updated to use memrefs with appropriately matching rank.

This patch updates the following arm_sme ops to require that input and output element types match: * `arm_sme.tile_load`, `arm_sme.tile_store`, `arm_sme.tile_load_slice`, `arm_sme.tile_store_slice`. In addition, it ensures that the base memref operand for `tile_load` and `tile_store` is always rank-2, aligning with the semantics of Arm SME tiles (always rank-2). This change is effectively a follow-up to llvm#135151: * "[mlir][vector] Tighten the semantics of vector.{load|store}" The patch also updates `createLoadStoreForOverTileSlices` in ArmSMEToSCF.cpp to fail when processing invalid tile stores like the following: ```mlir arm_sme.tile_store %arg0, %arg1[%c0] : memref<?x4xi8>, vector<[4]x[4]xi32> ``` This particular change fixes llvm#118769. As noted in the TODO, we should further extend op verification logic — I plan to address that in a follow-up patch.

This patch updates the following ArmSME ops to require that input and output element types match: * `arm_sme.tile_load`, `arm_sme.tile_store`, `arm_sme.tile_load_slice`, `arm_sme.tile_store_slice`. In addition, it ensures that the base memref operand for `tile_load` and `tile_store` is always rank-2, aligning with the semantics of Arm SME tiles (always rank-2). This change is effectively a follow-up to #135151: * "[mlir][vector] Tighten the semantics of vector.{load|store}" The patch also updates `createLoadStoreForOverTileSlices` in ArmSMEToSCF.cpp to fail when processing invalid tile stores like the following: ```mlir arm_sme.tile_store %arg0, %arg1[%c0] : memref<?x4xi8>, vector<[4]x[4]xi32> ``` This particular change fixes #118769. As noted in the TODO, we should further extend op verification logic — I plan to address that in a follow-up patch.

banach-space requested review from dcaballe, nicolasvasilache and Groverkss as code owners April 10, 2025 10:14

llvmbot added mlir:vectorops mlir mlir:memref mlir:sme mlir:vector labels Apr 10, 2025

banach-space mentioned this pull request Apr 10, 2025

[mlir][vector] Sink vector.extract/splat into load/store ops #134389

Merged

banach-space requested a review from MacDue April 10, 2025 10:24

MacDue reviewed Apr 10, 2025

View reviewed changes

dcaballe approved these changes Apr 11, 2025

View reviewed changes

banach-space mentioned this pull request Apr 11, 2025

[mlir][ArmSME] Audit arm_sme.tile_store #135396

Closed

MacDue approved these changes Apr 11, 2025

View reviewed changes

fixup! [mlir][vector] Tighten the semantics of vector.{load|store}

1949a74

Address comments from Ben

banach-space merged commit fedd79b into llvm:main Apr 11, 2025
11 checks passed

banach-space deleted the andrzej/vector/tighten_vector_load_store branch April 11, 2025 19:08

banach-space mentioned this pull request May 12, 2025

[mlir][ArmSME] Audit ArmSME load/store ops #139573

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[mlir][vector] Tighten the semantics of vector.{load|store} #135151

[mlir][vector] Tighten the semantics of vector.{load|store} #135151

Uh oh!

banach-space commented Apr 10, 2025

Uh oh!

llvmbot commented Apr 10, 2025 •

edited

Loading

Uh oh!

MacDue Apr 10, 2025

Uh oh!

banach-space Apr 11, 2025

Uh oh!

dcaballe left a comment

Uh oh!

MacDue left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[mlir][vector] Tighten the semantics of vector.{load|store} #135151

[mlir][vector] Tighten the semantics of vector.{load|store} #135151

Uh oh!

Conversation

banach-space commented Apr 10, 2025

Uh oh!

llvmbot commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MacDue Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

banach-space Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

dcaballe left a comment

Choose a reason for hiding this comment

Uh oh!

MacDue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

llvmbot commented Apr 10, 2025 •

edited

Loading