Skip to content

[MLIR] Add continuous tiling to transform dialect #82792

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 21, 2024

Conversation

muneebkhan85
Copy link
Contributor

@muneebkhan85 muneebkhan85 commented Feb 23, 2024

This patch enables continuous tiling of a target structured op using diminishing tile sizes. In cases where the tensor dimensions are not exactly divisible by the tile size, we are left with leftover tensor chunks that are irregularly tiled. This approach enables tiling of the leftover chunk with a smaller tile size and repeats this process recursively using exponentially diminishing tile sizes. This eventually generates a chain of loops that apply tiling using diminishing tile sizes.

Adds continuous_tile_sizes op to the transform dialect. This op, when given a tile size and a dimension, computes a series of diminishing tile sizes that can be used to tile the target along the given dimension. Additionally, this op also generates a series of chunk sizes that the corresponding tile sizes should be applied to along the given dimension.

Adds multiway attribute to transform.structured.split that enables multiway splitting of a single target op along the given dimension, as specified in a list enumerating the chunk sizes.

Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be
notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write
permissions for the repository. In which case you can instead tag reviewers by
name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review
by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate
is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Feb 23, 2024

@llvm/pr-subscribers-mlir-scf
@llvm/pr-subscribers-mlir-linalg

@llvm/pr-subscribers-mlir

Author: None (muneebkhan85)

Changes

This patch adds continuous tiling options to TileUsingForOp. In cases where the tensor dimensions are not exactly divisible by the tile size, we are left with leftover tensor chunks that are irregularly tiled. This approach attempts to tile the leftover chunk with a smaller tile size and repeats this process recursively using exponentially diminishing tile sizes. The transform eventually generates a chain of loops that apply tiling using diminishing tile sizes. The transform lowers from the linalg dialect to scf dialect.


Patch is 62.12 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/82792.diff

6 Files Affected:

  • (modified) mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td (+8-1)
  • (modified) mlir/include/mlir/Dialect/SCF/Transforms/TileUsingInterface.h (+14)
  • (modified) mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp (+32-8)
  • (modified) mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp (+457)
  • (modified) mlir/python/mlir/dialects/transform/structured.py (+4)
  • (added) mlir/test/Dialect/Linalg/continuous-tiling.mlir (+390)
diff --git a/mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td b/mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td
index 309573a562872f..4976c8e82db4df 100644
--- a/mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td
+++ b/mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td
@@ -1833,7 +1833,9 @@ def TileUsingForOp : Op<Transform_Dialect, "structured.tile_using_for",
     be as many handles as `ShapedType::kDynamic` values in the
     `static_sizes` attribute. A static size of `0` indicates that the dimension
     should not be tiled. No loop will be generated for such dimensions. If all
-    tile sizes are `0`, this transform is effectively a no-op.
+    tile sizes are `0`, this transform is effectively a no-op. To apply
+    continuous tiling `continuous_tiles` needs to be supplied with as many
+    boolean values as there are nested loops.
 
     This op returns handles to the tiled op (in the generated loop nest) and the
     generated loops. The number of loops is the number of tile sizes that are
@@ -1859,6 +1861,7 @@ def TileUsingForOp : Op<Transform_Dialect, "structured.tile_using_for",
   let arguments = (ins TransformHandleTypeInterface:$target,
                    Variadic<TransformAnyParamTypeOrAnyHandle>:$dynamic_sizes,
                    DefaultValuedOptionalAttr<DenseI64ArrayAttr, "{}">:$static_sizes,
+                   DefaultValuedOptionalAttr<DenseBoolArrayAttr, "{}">:$continuous_tiles,
                    DefaultValuedOptionalAttr<DenseI64ArrayAttr, "{}">:$interchange,
                    DefaultValuedOptionalAttr<DenseBoolArrayAttr, "{}">:$scalable_sizes);
   let results = (outs TransformHandleTypeInterface:$tiled_linalg_op,
@@ -1867,22 +1870,26 @@ def TileUsingForOp : Op<Transform_Dialect, "structured.tile_using_for",
     OpBuilder<(ins "TypeRange":$loopTypes,
                    "Value":$target,
                    "ArrayRef<int64_t>":$staticTileSizes,
+                   CArg<"ArrayRef<bool>", "{}">:$continuousTiles,
                    CArg<"ArrayRef<int64_t>", "{}">:$interchange,
                    CArg<"std::optional<ArrayRef<bool>>", "std::nullopt">:
                       $scalableSizes)>,
     OpBuilder<(ins "TypeRange":$loopTypes,
                    "Value":$target,
                    "ArrayRef<OpFoldResult>":$mixedTileSizes,
+                   CArg<"ArrayRef<bool>", "{}">:$continuousTiles,
                    CArg<"ArrayRef<int64_t>", "{}">:$interchange,
                    CArg<"std::optional<ArrayRef<bool>>", "std::nullopt">:
                       $scalableSizes)>,
     OpBuilder<(ins "Value":$target,
                    "ArrayRef<int64_t>":$staticTileSizes,
+                   CArg<"ArrayRef<bool>", "{}">:$continuousTiles,
                    CArg<"ArrayRef<int64_t>", "{}">:$interchange,
                    CArg<"std::optional<ArrayRef<bool>>", "std::nullopt">:
                       $scalableSizes)>,
     OpBuilder<(ins "Value":$target,
                    "ArrayRef<OpFoldResult>":$mixedTileSizes,
+                   CArg<"ArrayRef<bool>", "{}">:$continuousTiles,
                    CArg<"ArrayRef<int64_t>", "{}">:$interchange,
                    CArg<"std::optional<ArrayRef<bool>>", "std::nullopt">:
                       $scalableSizes)>,
diff --git a/mlir/include/mlir/Dialect/SCF/Transforms/TileUsingInterface.h b/mlir/include/mlir/Dialect/SCF/Transforms/TileUsingInterface.h
index 965ef9e203be28..b40291b5a80da5 100644
--- a/mlir/include/mlir/Dialect/SCF/Transforms/TileUsingInterface.h
+++ b/mlir/include/mlir/Dialect/SCF/Transforms/TileUsingInterface.h
@@ -71,6 +71,14 @@ struct SCFTilingOptions {
         mapping, [](auto attr) -> Attribute { return attr; });
     return *this;
   }
+
+  /// Specify which loops in the loop nest are to be continuously tiled.
+  SmallVector<bool> continuousTileMappingVector = {};
+  SCFTilingOptions &setCTileMapping(ArrayRef<bool> ctile) {
+    continuousTileMappingVector =
+        llvm::map_to_vector(ctile, [](auto attr) -> bool { return attr; });
+    return *this;
+  }
 };
 
 /// Transformation information returned after tiling.
@@ -92,6 +100,12 @@ FailureOr<SCFTilingResult> tileUsingSCF(RewriterBase &rewriter,
                                         TilingInterface op,
                                         const SCFTilingOptions &options);
 
+/// Method to continuously tile an op that implements the `TilingInterface`
+/// using `scf.for` for iterating over the tiles.
+FailureOr<SCFTilingResult>
+continuousTileUsingSCF(RewriterBase &rewriter, TilingInterface op,
+                            const SCFTilingOptions &options);
+
 /// Options used to control tile + fuse.
 struct SCFTileAndFuseOptions {
   /// The tiling options used to control the tiling of the consumer.
diff --git a/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp b/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp
index 299965bcfc3ab3..75080ea7721beb 100644
--- a/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp
+++ b/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp
@@ -2476,39 +2476,41 @@ DiagnosedSilenceableFailure transform::TileReductionUsingForallOp::applyToOne(
 void transform::TileUsingForOp::build(
     OpBuilder &builder, OperationState &result, TypeRange loopTypes,
     Value target, ArrayRef<int64_t> staticTileSizes,
-    ArrayRef<int64_t> interchange,
+    ArrayRef<bool> continuousTiles, ArrayRef<int64_t> interchange,
     std::optional<ArrayRef<bool>> scalableSizes) {
   return build(builder, result, loopTypes,
                /*target=*/target,
                /*mixedTileSizes=*/
                getAsOpFoldResult(builder.getI64ArrayAttr(staticTileSizes)),
-               interchange, scalableSizes);
+               continuousTiles, interchange, scalableSizes);
 }
 
 void transform::TileUsingForOp::build(
     OpBuilder &builder, OperationState &result, Value target,
-    ArrayRef<int64_t> staticTileSizes, ArrayRef<int64_t> interchange,
+    ArrayRef<int64_t> staticTileSizes, ArrayRef<bool> continuousTiles,
+    ArrayRef<int64_t> interchange,
     std::optional<ArrayRef<bool>> scalableSizes) {
   build(builder, result, target,
         getAsOpFoldResult(builder.getI64ArrayAttr(staticTileSizes)),
-        interchange, scalableSizes);
+        builder.getDenseBoolArrayAttr(continuousTiles), interchange, scalableSizes);
 }
 
 void transform::TileUsingForOp::build(
     OpBuilder &builder, OperationState &result, Value target,
-    ArrayRef<OpFoldResult> mixedTileSizes, ArrayRef<int64_t> interchange,
+    ArrayRef<OpFoldResult> mixedTileSizes, ArrayRef<bool> continuousTiles,
+    ArrayRef<int64_t> interchange,
     std::optional<ArrayRef<bool>> scalableSizes) {
   // Loop types are automaticaly splat by the callee, setting up one is
   // enough.
   SmallVector<Type> loopTypes(1, builder.getType<transform::AnyOpType>());
-  build(builder, result, loopTypes, target, mixedTileSizes, interchange,
+  build(builder, result, loopTypes, target, mixedTileSizes, continuousTiles, interchange,
         scalableSizes);
 }
 
 void transform::TileUsingForOp::build(
     OpBuilder &builder, OperationState &result, TypeRange loopTypes,
     Value target, ArrayRef<OpFoldResult> mixedTileSizes,
-    ArrayRef<int64_t> interchange,
+    ArrayRef<bool> continuousTiles, ArrayRef<int64_t> interchange,
     std::optional<ArrayRef<bool>> scalableSizes) {
   SmallVector<int64_t> staticTileSizes;
   SmallVector<Value> dynamicTileSizes;
@@ -2517,6 +2519,7 @@ void transform::TileUsingForOp::build(
   // attributes for multiple variadic operands. In the absence of this,
   // horrible bugs ensue.
   auto staticTileSizesAttr = builder.getDenseI64ArrayAttr(staticTileSizes);
+  auto continuousTilesAttr = builder.getDenseBoolArrayAttr(continuousTiles);
   unsigned numExpectedLoops =
       staticTileSizes.size() - llvm::count(staticTileSizes, 0);
   SmallVector<Type> resultTypes;
@@ -2535,6 +2538,7 @@ void transform::TileUsingForOp::build(
         /*target=*/target,
         /*dynamic_sizes=*/dynamicTileSizes,
         /*static_sizes=*/staticTileSizesAttr,
+        /*continuous_tiles=*/continuousTilesAttr,
         /*interchange=*/builder.getDenseI64ArrayAttr(interchange),
         /*scalable_sizes=*/expandedScalableSizes);
 }
@@ -2675,8 +2679,15 @@ transform::TileUsingForOp::apply(transform::TransformRewriter &rewriter,
     }
 
     tilingOptions.setInterchange(getInterchange());
-    FailureOr<scf::SCFTilingResult> maybeTilingResult =
+    tilingOptions.setCTileMapping(getContinuousTiles());
+
+    FailureOr<scf::SCFTilingResult> maybeTilingResult;
+    if (tilingOptions.continuousTileMappingVector.empty())
+      maybeTilingResult =
         tileUsingSCF(rewriter, tilingInterface, tilingOptions);
+    else
+      maybeTilingResult =
+          continuousTileUsingSCF(rewriter, tilingInterface, tilingOptions);
     if (failed(maybeTilingResult))
       return DiagnosedSilenceableFailure::definiteFailure();
 
@@ -2725,6 +2736,18 @@ ParseResult parseOptionalInterchange(OpAsmParser &parser,
   return success();
 }
 
+ParseResult parseOptionalContinuousTiles(OpAsmParser &parser,
+                                         OperationState &result) {
+  if (failed(parser.parseOptionalKeyword("continuous_tiles")))
+    return success();
+  if (failed(parser.parseEqual()))
+    return failure();
+  result.addAttribute(
+      transform::TileUsingForOp::getContinuousTilesAttrName(result.name),
+      DenseBoolArrayAttr::parse(parser, Type{}));
+  return success();
+}
+
 void printOptionalInterchange(OpAsmPrinter &p,
                               ArrayRef<int64_t> interchangeVals) {
   if (!interchangeVals.empty()) {
@@ -2747,6 +2770,7 @@ ParseResult transform::TileUsingForOp::parse(OpAsmParser &parser,
   if (parser.parseOperand(target) || parser.getCurrentLocation(&operandLoc) ||
       parseDynamicIndexList(parser, dynamicSizes, staticSizes, scalableVals) ||
       parseOptionalInterchange(parser, result) ||
+      parseOptionalContinuousTiles(parser, result) ||
       parser.parseOptionalAttrDict(result.attributes) ||
       parser.parseColonType(functionalType))
     return ParseResult::failure();
diff --git a/mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp b/mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp
index 1a84a59ddb69df..f81a901c82db99 100644
--- a/mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp
+++ b/mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp
@@ -31,6 +31,8 @@
 
 using namespace mlir;
 
+static constexpr char kLoopIndexLabel[] = "__loop_index__";
+
 scf::SCFTilingOptions &
 scf::SCFTilingOptions::setTileSizes(ArrayRef<OpFoldResult> ts) {
   assert(!tileSizeComputationFunction && "tile sizes already set");
@@ -309,6 +311,189 @@ static LogicalResult generateLoopNest(RewriterBase &rewriter, Location loc,
   return rewriter.notifyMatchFailure(loc, "unhandled loop type");
 }
 
+static void continuousLoopNestHelper(
+    OpBuilder &builder, Location loc, ArrayRef<Range> loopRanges,
+    SmallVector<LoopLikeOpInterface> &loops, uint64_t loopLevelIdx, uint64_t &loopIdx,
+    ArrayRef<OpFoldResult> tileSizes, SmallVector<bool> &CTileVector,
+    std::map<int, OpFoldResult> &sizesMap,
+    SmallVector<scf::ForOp> &innermostLoops, ValueRange destinationTensors = {},
+    bool isHeadOrInsideHeadLoop = false) {
+
+  Value offset = getValueOrCreateConstantIndexOp(
+      builder, loc, loopRanges[loopLevelIdx].offset);
+  Value size = getValueOrCreateConstantIndexOp(builder, loc,
+                                               loopRanges[loopLevelIdx].size);
+  Value tileSize =
+      getValueOrCreateConstantIndexOp(builder, loc, tileSizes[loopLevelIdx]);
+
+  AffineExpr sym0, sym1, sym2;
+  bindSymbols(builder.getContext(), sym0, sym1, sym2);
+  AffineMap defaultSplitMap =
+      AffineMap::get(0, 3, {sym1 - ((sym1 - sym0) % sym2)});
+  // Simplified map for use when step is power of 2 and lower bound
+  // is exactly divisble by step.
+  AffineMap powerSplitMap = AffineMap::get(0, 3, {sym1 - (sym1 % sym2)});
+
+  uint64_t tileSizeInt = *getConstantIntValue(tileSize);
+
+  // Enforce no tiling when tile size is zero.
+  // No need to create a loop here.
+  if (tileSizeInt == 0) {
+    continuousLoopNestHelper(builder, loc, loopRanges, loops, loopLevelIdx + 1,
+                             loopIdx, tileSizes, CTileVector, sizesMap,
+                             innermostLoops, destinationTensors,
+                             isHeadOrInsideHeadLoop);
+    return;
+  }
+
+  // The head loop is always tiled using the tile size specified
+  // in the size parameters to tile_using_for transform.
+  auto loop = builder.create<scf::ForOp>(
+      loc, offset, size, tileSize, destinationTensors,
+      [&](OpBuilder &bodyBuilder, Location bodyLoc, Value iv,
+          ValueRange /*iterArgs*/) {
+        sizesMap[loopIdx] =
+            getBoundedTileSize(bodyBuilder, bodyLoc, loopRanges[loopLevelIdx],
+                               iv, getAsOpFoldResult(tileSize));
+      });
+
+  loop->setAttr(kLoopIndexLabel, builder.getIndexAttr(loopIdx));
+  ++loopIdx;
+
+  scf::ForOp currentLoop = loop;
+  auto lbInt = getConstantIntValue(currentLoop.getLowerBound());
+  // Use simplified powerSplitMap instead of the default when possible.
+  bool usePowerSplit = (lbInt.has_value()) &&
+                       (*lbInt % tileSizeInt == static_cast<int64_t>(0)) &&
+                       (tileSizeInt == llvm::bit_floor(tileSizeInt));
+
+  AffineMap splitMap = usePowerSplit ? powerSplitMap : defaultSplitMap;
+
+  bool isInnermostLoop = loopLevelIdx == loopRanges.size() - 1;
+  if (isInnermostLoop)
+    innermostLoops.push_back(currentLoop);
+
+  if (isHeadOrInsideHeadLoop)
+    loops.push_back(loop);
+
+  builder.setInsertionPointToEnd(loop.getBody());
+
+  // Create the nested loop inside current loop.
+  if (!isInnermostLoop)
+    continuousLoopNestHelper(builder, loop->getLoc(), loopRanges, loops,
+                             loopLevelIdx + 1, loopIdx, tileSizes, CTileVector,
+                             sizesMap, innermostLoops, loop.getRegionIterArgs(),
+                             isHeadOrInsideHeadLoop);
+
+  // Apply continuous tiling to current loop if continuous_tiles
+  // specifies so.
+  while (CTileVector[loopLevelIdx] && tileSizeInt > 1) {
+
+    uint64_t maxPower = llvm::bit_floor(tileSizeInt);
+    tileSizeInt = maxPower == tileSizeInt ? maxPower >> 1 : maxPower;
+
+    builder.setInsertionPoint(currentLoop);
+
+    auto constStepOp = builder.create<arith::ConstantIndexOp>(loc, tileSizeInt);
+
+    Value splitBound = builder.createOrFold<affine::AffineApplyOp>(
+        loc, splitMap,
+        ValueRange{currentLoop.getLowerBound(), currentLoop.getUpperBound(),
+                   currentLoop.getStep()});
+
+    builder.setInsertionPointAfter(currentLoop);
+    auto additionalLoop =
+        builder.create<scf::ForOp>(currentLoop->getLoc(), splitBound, size,
+                                   constStepOp, destinationTensors);
+
+    additionalLoop.getInitArgsMutable().assign(currentLoop->getResults());
+    currentLoop.getUpperBoundMutable().assign(splitBound);
+
+    builder.setInsertionPointToStart(additionalLoop.getBody());
+    AffineExpr s0, s1, d0;
+    bindDims(builder.getContext(), d0);
+    bindSymbols(builder.getContext(), s0, s1);
+    AffineMap minMap = AffineMap::get(1, 1, {s0}, builder.getContext());
+    auto additionalLoopAffineMin = affine::makeComposedAffineMin(
+        builder, loc, minMap,
+        SmallVector<OpFoldResult>{splitBound, getAsOpFoldResult(constStepOp),
+                                  size});
+
+    currentLoop = additionalLoop;
+
+    sizesMap[loopIdx] = getAsOpFoldResult(additionalLoopAffineMin);
+
+    // Add custom loop-indexing attribute to each loop op.
+    // Continuous tiling ends up generating many loop nestings and
+    // each loop can be identified with its loop-index attribute.
+    // This is needed later to retrieve the sizes from sizesMap.
+    currentLoop->setAttr(kLoopIndexLabel, builder.getIndexAttr(loopIdx));
+
+    ++loopIdx;
+
+    if (isInnermostLoop)
+      innermostLoops.push_back(currentLoop);
+
+    builder.setInsertionPointToEnd(currentLoop.getBody());
+
+    // Create the nested loop inside current loop.
+    if (!isInnermostLoop)
+      continuousLoopNestHelper(builder, currentLoop->getLoc(), loopRanges,
+                               loops, loopLevelIdx + 1, loopIdx, tileSizes,
+                               CTileVector, sizesMap, innermostLoops,
+                               currentLoop.getRegionIterArgs());
+  }
+
+  // Always yield the result of the tail-end loop as this
+  // will have all the processed tiles.
+  if (!isa<func::ReturnOp>(currentLoop->getBlock()->back())) {
+    builder.setInsertionPointToEnd(currentLoop->getBlock());
+    builder.create<scf::YieldOp>(currentLoop.getLoc(),
+                                 currentLoop.getResults());
+  }
+  /// For the outermost loop insert the tail-end loop in front of loops
+  /// structure so that it's results can be used for replacements in the
+  /// function return. This is removed from the head of loops later.
+  else
+    loops.insert(loops.begin(), currentLoop);
+
+  destinationTensors = loop.getRegionIterArgs();
+}
+
+/// Generate an empty loop nest that represents the continuous-tiled loop nest
+/// shell.
+/// - `loopRanges` specifies the lb, ub and step of the untiled iteration space.
+/// - `tileSizes` is the tile sizes to use in the first tiling attempt. Zero
+/// represent untiled loops.
+/// - In ``sizesMap` return the multi-dimensional size of
+///   the tile processed within the inner most loop.
+/// - `CTileVector` specifies which loop nest should be continuously tiled.
+/// Note that this methods adds `scf.yield` operation for all but the innermost
+/// loop. These yield the value returned by the immediately inner tail-end loop.
+/// The caller is expected to add the scf.yield operation for all innermost
+/// loops.
+static SmallVector<LoopLikeOpInterface> generateContinuousTileLoopNest(
+    OpBuilder &builder, Location loc, ArrayRef<Range> loopRanges,
+    ArrayRef<OpFoldResult> tileSizes, SmallVector<bool> CTileVector,
+    std::map<int, OpFoldResult> &sizesMap,
+    SmallVector<scf::ForOp> &innermostLoops,
+    ValueRange destinationTensors = {}) {
+  if (loopRanges.empty())
+    return {};
+  assert(loopRanges.size() == tileSizes.size() &&
+         "expected as many tile sizes as loop ranges");
+  OpBuilder::InsertionGuard guard(builder);
+  SmallVector<LoopLikeOpInterface> loops;
+
+  uint64_t loopIdx = 0;
+  continuousLoopNestHelper(builder, loc, loopRanges, loops, 0, loopIdx,
+                           tileSizes, CTileVector, sizesMap, innermostLoops,
+                           destinationTensors, true);
+
+  return loops;
+}
+
+
 /// Append the specified additional `newInitOperands` operands to the
 /// loops existing `init` operands (or similar), and replace `loopOp` with
 /// the new loop that has the additional init operands. The loop body of
@@ -679,6 +864,278 @@ mlir::scf::tileUsingSCF(RewriterBase &rewriter, TilingInterface op,
   return scf::SCFTilingResult{tilingResult->tiledOps, loops, replacements};
 }
 
+/// Implementation of continuous-tiling transformation of `op`
+/// that implements the `TilingInterface` using a sequence of
+/// `scf.for` loops to iterate over tiles of exponentially
+/// diminishing sizes.
+///
+/// The generated sequence of `scf.for` loops iterate over tiles of
+/// exponentially diminishing sizes as opposed to vanilla tiling scheme where
+/// only the tile sizes specified as transform parameters are used.
+/// continuous-tiling first applies regular tiling using the specified
+/// tile size, then halves the tile size and uses it to tile the leftover
+/// chunk. This process of halving the tile size and tiling the leftover
+/// chunk is repeated until tile size reaches 1. The transform parameter
+/// continuous_tiles controls which nested loop should be tiled. If all
+/// arguments in continuous_tiles are set to false, the result is identical
+/// to vanilla tiling transform.
+///
+/// When tiling a tensor of size M with tile size 8, it generates
+/// the following loop to tile
+///
+/// for (int i = 0; i < M; i += 8) {
+///   int size = min(8, M-i);
+///   // size is unknown at compile time because M is dynamic
+///   // compute using dynami...
[truncated]

Copy link

github-actions bot commented Feb 23, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Contributor

@nicolasvasilache nicolasvasilache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for proposing to extend tiling @muneebkhan85

Quick question, have you looked at the MultiTileSizes op (

%1:3 = transform.structured.multitile_sizes %0 { dimension = 0, target_size = 3} : (!transform.any_op) -> !transform.any_op
).

Can the 2 approaches be (partially) merged to have less code overall?

Also adding @ftynse for visibility.

Copy link
Contributor

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just putting a marker cause I need to look more deeply into this. THis is a significant change and needs some time to review.

Copy link
Contributor

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surfacing comments as I review. More to come. THis definitely is interesting, I am trying to understand the implementation. Is it possible to not duplicate all the tiling logic

/// - `CTileVector` specifies which loop nest should be continuously tiled.
/// Note that this methods adds `scf.yield` operation for all but the innermost
/// loop. These yield the value returned by the immediately inner tail-end loop.
/// The caller is expected to add the scf.yield operation for all innermost
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found this to be a bit of a foot gun. At some point of the implementation I did the same where the caller would apply the scf.yield. But that is awkward for the caller to do. Can we instead try to generate the scf.yield during the transformation itself?

Copy link
Contributor Author

@muneebkhan85 muneebkhan85 Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but this was a leftover from the previous "copied" comment. The scf.yield is actually generated for all inner loops as part of the call itself after their creation (as you suggested). Fixed comment now.


// Always yield the result of the tail-end loop as this
// will have all the processed tiles.
if (!isa<func::ReturnOp>(currentLoop->getBlock()->back())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why this is looking for a func.return

Copy link
Contributor Author

@muneebkhan85 muneebkhan85 Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, perhaps the comments aren't that good here. It only inserts a YieldOp when there's no return at the end of the block. If there is a return then we are dealing with the outermost loop and we don't want to create a YieldOp, instead the return needs to be re-written to return the right value. The 'if' makes sure that we are handling the first case and creating a YieldOp for the inner loops only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return at the end is an artifact of the test. This seems very fragile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed this now, so that it doesn't depend on func.return to detect the outermost loop.

@nicolasvasilache nicolasvasilache self-requested a review February 25, 2024 20:03
Copy link
Contributor

@nicolasvasilache nicolasvasilache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just putting a marker cause I need to look more deeply into this. THis is a significant change and needs some time to review.

Indeed it is.

I will add that we need the prerequisite cleanups (discussed here and here) to land first before any new code is added on these paths.

Speaking of which @MaheshRavishankar , what is the status of the required cleanup?

@MaheshRavishankar
Copy link
Contributor

Just putting a marker cause I need to look more deeply into this. THis is a significant change and needs some time to review.

Indeed it is.

I will add that we need the prerequisite cleanups (discussed here and here) to land first before any new code is added on these paths.

Speaking of which @MaheshRavishankar , what is the status of the required cleanup?

Fighting my backlog a bit. I am planning to start it in a couple of weeks.

@MaheshRavishankar
Copy link
Contributor

High level comment, why cant this just be an after tile and fuse peeling transformation. What is happening here seems to be peeling applied repeatedly.

Copy link
Member

@ftynse ftynse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a bunch of mostly stylistic comments that I think are useful regardless of this code going in as is or not. I only commented once a particular kind of issues and there may be other issues of the same kind in the code, please fix all.

Overall, this looks quite interesting. I agree with other reviewers in my desire to see more code reuse. It is possible to alter the existing code to allow for more reuse if needed. If this is approached from the transform dialect perspective, it is possible to decompose this transformation into pieces similar to what was implemented for multisize tiling in

// This implements a 2D multisize tiling with target sizes [3, 10].
. Specifically, one can separate emission on the IR computing tile sizes (before all loops since they won't change), splitting the operation into pieces of appropriate size, and tiling those pieces. I suppose that C++ code can be structured similarly and reuse the existing functions for splitting/tiling so in the end only the size computation is emitted. If a higher-level transform operation is still desired, I'd consider implementing it as a rewrite on the transform dialect itself, rather than as complex logic within C++. This was considered for multisize tiling, but not implement given the lack of interest.


/// Specify which loops in the loop nest are to be continuously tiled.
SmallVector<bool> continuousTileMappingVector = {};
SCFTilingOptions &setCTileMapping(ArrayRef<bool> ctile) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: let's use full names, setContinuousTileMapping.

SmallVector<bool> continuousTileMappingVector = {};
SCFTilingOptions &setCTileMapping(ArrayRef<bool> ctile) {
continuousTileMappingVector =
llvm::map_to_vector(ctile, [](auto attr) -> bool { return attr; });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: llvm::to_vector works just fine, no need to have an additional map with identity function.

tilingOptions.setCTileMapping(getContinuousTiles());

FailureOr<scf::SCFTilingResult> maybeTilingResult;
if (tilingOptions.continuousTileMappingVector.empty())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: please add braces if bodies don't fit into one line.

@@ -309,6 +311,188 @@ static LogicalResult generateLoopNest(RewriterBase &rewriter, Location loc,
return rewriter.notifyMatchFailure(loc, "unhandled loop type");
}

static void continuousLoopNestHelper(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: please document top-level functions appropriately.

@@ -309,6 +311,188 @@ static LogicalResult generateLoopNest(RewriterBase &rewriter, Location loc,
return rewriter.notifyMatchFailure(loc, "unhandled loop type");
}

static void continuousLoopNestHelper(
OpBuilder &builder, Location loc, ArrayRef<Range> loopRanges,
SmallVector<LoopLikeOpInterface> &loops, uint64_t loopLevelIdx,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a (const) reference to SmallVectorImpl to avoid copying on-stack elements of the vector. If the callee doesn't modify the length of the vector, take a MutableArrayRef/ArrayRef instead.

currentLoop.getRegionIterArgs());
}

// Yiled results for all loops in the loop nest except for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit typo: yield.

// will have all the processed tiles.
if (loopLevelIdx != 0) {
builder.setInsertionPointToEnd(currentLoop->getBlock());
builder.create<scf::YieldOp>(currentLoop.getLoc(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we probably want to assert somewhere that the are results. Otherwise, the default builder creates a yield operation and this will be the second one (yes, in hindsight, this wasn't a good choice).

Comment on lines 922 to 923
return rewriter.notifyMatchFailure(
op, "missing tile size computation function");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These error messages are invisible unless the function is called from a pattern rewriting engine, which I suspect is not the case here. If you want to provide a better error description, you can either use DiagnosedSilenceableFailure from the Transform dialect or ask the caller to give an llvm::raw_ostream, defaulted to llvm::nulls that will accept the error message.

LLVM_DEBUG({
if (!forLoops.empty()) {
llvm::dbgs() << "LoopNest shell :\n";
forLoops.front()->getBlock()->dump();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You likely want to ->print(llvm::dbgs()) for the case when error stream and debug stream differ.

Comment on lines 1119 to 1122
/// Remove outermost tailend loop as its only use was to compute
/// replacements.
forLoops.erase(forLoops.begin());

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure what happens here: isn't this loop produced by tiling and therefore should appear in the results?

@muneebkhan85
Copy link
Contributor Author

High level comment, why cant this just be an after tile and fuse peeling transformation. What is happening here seems to be peeling applied repeatedly.

An issue here is that the static tile size is created using AffineMin at the time of creating the loop nests. This makes the issue more complicated than just having to create new tail-loops and clone regions from the original loop into it.

@rolfmorel
Copy link
Contributor

rolfmorel commented Mar 21, 2024

Hi @ftynse, @nicolasvasilache, @MaheshRavishankar

Thank you for pointing out how multisize tiling is handled and suggesting that the continuous tiling transform can be modelled after it.

@muneebkhan85 and I had a discussion on how continuous tiling might be expressible using the transform dialect. The main issue we come up against is that rather than a statically known number of loops being generated, as with multisize tiling, continuous tiling can generate an arbitrary number of tiling loops. That is, whereas transform.structured.multitile_sizes returns two tile sizes and one split point, something like transform.structured.continuous_tile_sizes would output n tile sizes and n-1 split points (based on the target size and the size of the dimension). As such we need one or more ops to encapsulate iterating over n, the number of splits/tiling loops.

By modelling continuous tiling after multisize tiling, we came up with the following:

%linalg = transform.structured.match ops{["linalg.generic"]} in %payload
%tile_sizes, %split_points = transform.structured.continuous_tile_sizes %linalg { dimension = 0, target_size = 9} : (!transform.any_op) -> (!transform.param<i64>, !transform.param<i64>)
%head_linalg, %tail_linalgs = transform.structured.split %linalg after %split_points { dimension  = 0 }
%linalg_splits = transform.merge_handles %head_linalg, %tail_linalgs
%tiled_splits = transform.foreach %linalg_splits, %tile_sizes {
  ^bb1(%linalg_split: !transform.any_op, %tile_size: !transform.any_param):
  %tiled_linalg_split = transform.structured.tile_using_for %linalg_split [%tile_size]
  transform.yield %tiled_linalg_split
}

The above departs from the current transform dialect in three ways:

  • transform.structured.continuous_tile_sizes is a new op with two results:
    • The first result (%tile_sizes) is a list containing the target size followed by all smaller powers of 2 (9, 8, 4, 2, 1 in this example)
    • The second result (%split_points) is a list of split points of the iteration space of the given dimension, specifying where each of the tile sizes should be applied (à la the split point of multitile_sizes).
  • transform.structured.split is changed such that it performs a multiway split at once. The above assumes that split would still produce only two outputs, though the second handle (%tail_linalgs) would collect all but the first split-off part.
  • transform.foreach is changed to take multiple handles by making target variadic and iterates over all handles at once by "zipping" the lists associated with the handles. The enclosed block would take as many arguments as there are targets.

For the generalisation of foreach we have come across other use cases as well (in particular for fusing loops coming from co-indexed handles).

Could you let us know if the above is similar to what you had in mind? We would be happy to work on landing the above three changes.

If there are other more preferable paths to supporting continuous tiling, do let us know. @ftynse, you mentioned an approach whereby the transform dialect would itself be subject to rewrites. Do you have a reference for us to look at?

@ftynse
Copy link
Member

ftynse commented Mar 21, 2024

SGTM in principle. I'm not quite sure how this tiling would work for multiple dimensions though.

transform.structured.split is changed such that it performs a multiway split at once. The above assumes that split would still produce only two outputs, though the second handle (%tail_linalgs) would collect all but the first split-off part.

Nit: I'd rather just have one result, and your example merges the two results anyway. If the two distinct results are necessary, we can either use transform.split_handle or have an attribute controlling this.

transform.foreach is changed to take multiple handles by making target variadic and iterates over all handles at once by "zipping" the lists associated with the handles. The enclosed block would take as many arguments as there are targets.

Nit: make sure to define behavior when lengths don't match, e.g. zip_shortest vs. zip (produces failure on mismatch). Making it visible in the syntax is also nice.

@ftynse, you mentioned an approach whereby the transform dialect would itself be subject to rewrites. Do you have a reference for us to look at?

I don't think something was committed upstream because there was no pressing need for that. However, transform dialect is a just a dialect. One should be able to write a pass based that applies patterns rewriting transform dialect operations quite easily. One of these patterns may expand a hypothetical high-level op into the actual implementation as proposed above.

@muneebkhan85
Copy link
Contributor Author

@ftynse Thanks for the prompt feedback on this.

SGTM in principle. I'm not quite sure how this tiling would work for multiple dimensions though.

It should work the same way as it does for multitile_sizes where each dimension is tiled separately.
The above example from @rolfmorel only shows applying continuous tiles for dimension 0. The same
could be repeated for dimension 1 after utilizing replicate and split in the same way as is the case for
multi size tiling.

@rolfmorel and I will work on the proposed changes for transform.structured.split and
transform.foreach in separate PRs (that will refer to the discussion here) while repurposing this PR
to introduce transform.structured.continuous_tile_sizes op.

@rolfmorel
Copy link
Contributor

rolfmorel commented Mar 22, 2024

Here's an example of how, with the proposed interface, continuous tiling can be applied to two dimensions. This has the exponentially decreasing pattern of tile sizes for the inner dimension occurring within each tiling loop of the outer dimension (where these outer loops themselves have exponentially decreasing tile sizes):

%linalg = transform.structured.match ops{["linalg.generic"]} in %payload // assume only one linalg matched
%tile_sizes_dim0, %split_points_dim0 = transform.structured.continuous_tile_sizes %linalg { dimension = 0, target_size = 9} : (!transform.any_op) -> (!transform.param<i64>, !transform.param<i64>)
%tile_sizes_dim1, %split_points_dim1 = transform.structured.continuous_tile_sizes %linalg { dimension = 1, target_size = 7} : (!transform.any_op) -> (!transform.param<i64>, !transform.param<i64>)
%linalg_dim0_splits = transform.structured.split %linalg after %split_points_dim0 { dimension  = 0, multiway }
transform.foreach %linalg_dim0_splits, %tile_sizes_dim0 {
  ^bb1(%linalg_dim0_split: !transform.any_op, %tile_size_dim0: !transform.any_param):
  %linalg_dim0_split_tiled, %tile_loop = transform.structured.tile_using_for %linalg_dim0_split [%tile_size_dim0]
  %linalg_dim1_splits = transform.structured.split %linalg_dim0_split_tiled after %split_points_dim1 { dimension = 1, multiway }
  transform.foreach %linalg_dim1_splits, %tile_sizes_dim1 {
    ^bb2(%linalg_dim1_split: !transform.any_op, %tile_size_dim1: !transform.any_param):
    transform.structured.tile_using_for %linalg_dim1_split [0, %tile_size_dim1]
    transform.yield
  }
  transform.yield
}

In the above, the multiway attribute tells transform.structured.split to interpret the vector of split points associated with the split_points argument as a list of multiple points to split (one dimension of) the target linalg along (rather than the handles for target and split_points being co-indexed with just one split point for each linalg).

This pattern naturally extends to higher dimensions.

@ftynse
Copy link
Member

ftynse commented Mar 28, 2024

Ok, this design is fine with me.

@MaheshRavishankar
Copy link
Contributor

High level comment, why cant this just be an after tile and fuse peeling transformation. What is happening here seems to be peeling applied repeatedly.

An issue here is that the static tile size is created using AffineMin at the time of creating the loop nests. This makes the issue more complicated than just having to create new tail-loops and clone regions from the original loop into it.

Sorry for the delay. I am still not convinced this is the easiest way to doing this. Looking at the description, you are trying to split full tiles and partial tiles (and then applying that recursively). As easier approach to me is to do something like this

%result = <linalg_op>(%operand1, %operand2, ...)

you can first split into two

%operand0_slice = tensor.extract_slice %operand0
%operand1_slice = tensor.extract_slice %operand1
....
%result_slice = <linalg_op>(%operand0_slice, %operand1_slice,...)
...
%operand0_remaining = tensor.extract_slice %operand0
%operand1_remaining = tensor.extract_slice %operand1
....
%result_remaining = <linalg_op>(%operand0_remaining, %operand1_remaining)
....
%result = tensor.insert_slice %result_slice into ....
%result1 = tensor.insert_slice %result_remaining  into ...

Now you can tile the individual ops. This does not need any changes to the Tiling implementation itself, and is also modular. You can take the second linalg op and apply the same procedure repeatedly to get all the different loops. That would be upto the caller to do (or wrapped in a helper function), but the core logic is just this one step.

@muneebkhan85
Copy link
Contributor Author

High level comment, why cant this just be an after tile and fuse peeling transformation. What is happening here seems to be peeling applied repeatedly.

An issue here is that the static tile size is created using AffineMin at the time of creating the loop nests. This makes the issue more complicated than just having to create new tail-loops and clone regions from the original loop into it.

Sorry for the delay. I am still not convinced this is the easiest way to doing this. Looking at the description, you are trying to split full tiles and partial tiles (and then applying that recursively). As easier approach to me is to do something like this

%result = <linalg_op>(%operand1, %operand2, ...)

you can first split into two

%operand0_slice = tensor.extract_slice %operand0
%operand1_slice = tensor.extract_slice %operand1
....
%result_slice = <linalg_op>(%operand0_slice, %operand1_slice,...)
...
%operand0_remaining = tensor.extract_slice %operand0
%operand1_remaining = tensor.extract_slice %operand1
....
%result_remaining = <linalg_op>(%operand0_remaining, %operand1_remaining)
....
%result = tensor.insert_slice %result_slice into ....
%result1 = tensor.insert_slice %result_remaining  into ...

Now you can tile the individual ops. This does not need any changes to the Tiling implementation itself, and is also modular. You can take the second linalg op and apply the same procedure repeatedly to get all the different loops. That would be upto the caller to do (or wrapped in a helper function), but the core logic is just this one step.

@MaheshRavishankar Thank you for the comments. After some deliberation we have decided to take another route for this without requiring any changes to the tiling logic. Please see the discussions here

#82792 (comment)

together with the example posted here

#82792 (comment)

The idea is to create an op continuous_tile_sizes that -- when provided a linalg op, dimension to tile and tile size -- would return as result 1) a list of tile sizes, and 2) a list of split points in a pattern similar to multitile_sizes except that each of these lists would contain more than one element. Then, using a multi-way split, we could use the existing tiling op to achieve continuous tiling.

@muneebkhan85
Copy link
Contributor Author

muneebkhan85 commented May 13, 2024

@ftynse @MaheshRavishankar @nicolasvasilache I have re-purposed this PR in line with our earlier discussion here and my latest commits (1) introduce a continuous tiling operation (structured.continuous_tile_sizes) that computes the various (exponentially diminishing) tile sizes and split points along the given dimension of a traget linalg op, and (2) enabling multiway splitting in SplitOp transform.structured.split that allows for the target op to be split at multiple points along the given dimension. (3) I have included a test that also serves as an example on how to use results from structured.continuous_tile_sizes with multiway split op.

The third part of the proposal was to modify transform.foreach as described in the comment above, also accessible here

#82792 (comment)

@rolfmorel is currently working with that and will submit it as a separate PR.

Comment on lines 1419 to 1423
the iteration space of the specified dimension. `static_split_point` and
`dynamic_split_point` in this case is a list of chunk sizes that the given
dimension should be split into. With `multiway` it produces two handles;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"split point" accepting chunk sizes sounds misleading. We should rename it. It is okay for the two-way case to take one size, which is same as the split point.

Also, document what happens when the sum of chunk sizes does not add up to the total size of the op.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to chunk sizes now with the description saying how it works when given just one value. The description also documents what happens when the chunk sizes do not cover the entire iteration space.

the first handle is a list of the multiple parts of the structured op
after splitting, where the target dimensions for each linalg op in the
list corresponds to the chunk sizes specfied in the input split list.
The second handle is empty.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider just returning one handle always. It can be later split into parts using split_handle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested this out. It can surely be a single handle and can be split into multiple handles using split_handle, though there's a caveat that split_handle needs to specify the expected number of outputs (as list of output types) depending on the number of payloads it receives in the handle.

In the multitile-size example mlir/test/Dialect/Linalg/multisize-tiling-full.mlir the 2nd split op generates two lists that are used by the two subsequent tile_using_for. Creating a single output for the split op would mean merging these two lists and split_handle would then see a handle of 4 payloads. This complicates things a bit in the sense that some reworking of the usage and examples are needed. Keeping it to produce two outputs for now. I think this change would be good for a separate PR.

DeclareOpInterfaceMethods<TransformOpInterface>,
ReportTrackingListenerFailuresOpTrait]> {
let description = [{
This transform takes a linalg as target and a dimension and target size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This transform takes a linalg as target and a dimension and target size
This transform takes a handle to a linalg op as target and a dimension and target size

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below. We don't usually refer to an op as "the linalg".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -2269,8 +2269,20 @@ SplitOp::apply(transform::TransformRewriter &rewriter,
// Collect the dynamic split points if provided.
SmallVector<Operation *> payload =
llvm::to_vector(state.getPayloadOps(getTarget()));

bool isMultiwaySplit = getMultiway() ? true : false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

? true : false is a tautology. It can be omitted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know what I was thinking here. Fixed.

bool isMultiwaySplit = getMultiway() ? true : false;

if (isMultiwaySplit && !llvm::hasSingleElement(payload)) {
return emitDefiniteFailure() << "requires exactly one target when "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use silenceable failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

splitPoints.push_back(splitPoint);
}

auto makeOpFromValue = [&](ArrayRef<Value> values) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto makeOpFromValue = [&](ArrayRef<Value> values) {
auto getDefiningOps = [&](ArrayRef<Value> values) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't actually make the op.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment on lines 2738 to 2739
return llvm::to_vector(
llvm::map_range(values, [&](Value value) -> Operation * {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map_to_vector

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment on lines 2744 to 2747
transformResults.set(cast<OpResult>(getTileSizes()),
makeOpFromValue(spec->tileSizes));
transformResults.set(cast<OpResult>(getSplitPoints()),
makeOpFromValue(splitPoints));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, we have value handles. They are rather underused and I won't push for them here to maintain consistency, but it would make this cleaner.

Copy link
Contributor Author

@muneebkhan85 muneebkhan85 May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good suggestion and good to know that it can be done using Values, but skipping for consistency.

Comment on lines 179 to 189
// Find the trip count of the iteration space dimension for which the tile
// sizes are computed.
SmallVector<OpFoldResult> allShapes =
op.createFlatListOfOperandDims(b, b.getLoc());
AffineMap shapesToLoops = op.getShapesToLoopsMap();
SmallVector<OpFoldResult> loopRanges =
makeComposedFoldedMultiResultAffineApply(b, op.getLoc(), shapesToLoops,
allShapes);

Value loopRange =
getValueOrCreateConstantIndexOp(b, op.getLoc(), loopRanges[dimension]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use/adapt TilingInterface for this?

Copy link
Contributor Author

@muneebkhan85 muneebkhan85 May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have adapted the dynamic case computeContinuousTileSizes to entirely use TilingInterface. For consistency, I have modified the static case computeStaticContinuousTileSizes to take a TilingInterface op, but it still relies on a casted LinalgOp internally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self (@ftynse) : I haven't double checked the math here.

@muneebkhan85 muneebkhan85 changed the title [MLIR] Add continuous tiling to TileUsingForOp [MLIR] Add continuous tiling to transform dialect May 30, 2024
@muneebkhan85 muneebkhan85 force-pushed the continuous_tiling branch 2 times, most recently from f4f657a to f7a1dec Compare May 31, 2024 10:59
Copy link
Contributor

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a fly-by comment. I cant say for sure, but please refrain from using any of the linalg::tile* methods and use the scf::tileUsingSCG* methods. The linalg:: are being removed.

@rolfmorel
Copy link
Contributor

rolfmorel commented Jun 6, 2024

Here's a suggestion for a test case. It requires the pending change to transform.foreach to land first though (#93705).

PAYLOAD_CONTAINING_ANY_LINALG_HERE

module attributes {transform.with_named_sequence} {
  transform.named_sequence @tile_dim0_continuously(%linalg: !transform.any_op {transform.readonly}, %target_size: !transform.param<i64>) -> !transform.any_op, !transform.any_op {
    %tile_sizes, %split_points = transform.structured.continuous_tile_sizes %linalg descending_from %target_size { dimension = 0 } : (!transform.any_op, !transform.param<i64>) -> (!transform.param<i64>, !transform.param<i64>)
    %linalg_splits, %empty = transform.structured.split %linalg after %split_points { dimension  = 0, multiway } : ...
    %tiled_splits, %loops = transform.foreach %linalg_splits, %tile_sizes {
      ^bb1(%linalg_split: !transform.any_op, %tile_size: !transform.param<i64>):
      %tiled_linalg_split, %dim0_loop = transform.structured.tile_using_for %linalg_split [%tile_size] : ...
      transform.yield %tiled_linalg_split, %dim0_loop : ...
    }
    transform.yield %tiled_splits, %loops : ...
  }
  transform.named_sequence @__transform_main(%payload: !transform.any_op) {
    %linalg = transform.structured.match ops{["linalg.generic"]} in %payload  : ...
    %target_size = transform.param.constant 9 : !transform.param<index>
    // NB: we would need a foreach here/in the named sequence if %linalg would associate multiple linalg.generics
    %tiled_linalg_splits, %tiling_loops = transform.include @tile_dim0_continuously failures(propagate) (%linalg) : (!transform.any_op) -> !transform.any_op, !transform.any_op
    transform.yield 
  }
}

Note the above assumes that transform.structured.continuous_tile_sizes is able to take target_size as a param instead of requiring it to be passed as an attribute. (Note as well that the above is not quite syntactically correct: e.g., some types are left as homework.)

Demonstrating abstraction

The main point of interest regarding the test case is that it demonstrates how to do abstraction with the Transform dialect. That is, it shows how a higher-level transform can be implemented with the Transform dialect while the details of this implementation can be hidden. In this way, the higher-level transform's functionality is exposed basically* as if it were available like any other Transform op.

(*: Note that (currently) a named sequence cannot be parametric w.r.t. attribute arguments that its contained ops take as attributes and not as params. So each op that we would like to use in named sequences - and whose attribute arguments we'd like to expose as arguments of the sequence - needs to be modified to take the relevant attributes as transform.param arguments as well. Additionally these attribute arguments need to be packaged up as params before being handed to the named sequence - see transform.param.constant right before transform.include up above. This is not ideal.)

@ftynse
Copy link
Member

ftynse commented Jun 7, 2024

The main point of interest regarding the test case is that it demonstrates how to do abstraction with the Transform dialect. That is, it shows how a higher-level transform can be implemented with the Transform dialect while the details of this implementation can be hidden. In this way, the higher-level transform's functionality is exposed basically* as if it were available like any other Transform op.

(*: Note that (currently) a named sequence cannot be parametric w.r.t. attribute arguments that its contained ops take as attributes and not as params. So each op that we would like to use in named sequences - and whose attribute arguments we'd like to expose as arguments of the sequence - needs to be modified to take the relevant attributes as transform.param arguments as well. Additionally these attribute arguments need to be packaged up as params before being handed to the named sequence - see transform.param.constant right before transform.include up above. This is not ideal.)

I'm open to suggestions here. Note that, despite people writing a lot of it directly, transform dialect is still an IR so some choices can be made for a compiler to reason more easily about the IR, such as param constants being operations similarly to all other constants in MLIR.

@rolfmorel
Copy link
Contributor

rolfmorel commented Jun 7, 2024

I'm open to suggestions here. Note that, despite people writing a lot of it directly, transform dialect is still an IR so some choices can be made for a compiler to reason more easily about the IR, such as param constants being operations similarly to all other constants in MLIR.

Yes, valid point - having the param constants out in front of transform.includes is mostly a stylistic concern and could even be useful if we were to do things to Transform IR itself.

The more serious issue we are seeing is that the current MO is for all ops whose attribute arguments we would like to abstract over to be modified - op by op, argument by argument - to take these attributes as params as well. While this approach is feasible, it feels like there's potentially a better mechanism for addressing this need.

As this is becoming off-topic for this PR, we will post on Discourse to start a discussion regarding better ways of dealing with the above.

@ftynse
Copy link
Member

ftynse commented Jun 14, 2024

As this is becoming off-topic for this PR, we will post on Discourse to start a discussion regarding better ways of dealing with the above.

Please do.

Let me know if you need help merging this, it's approved.

@muneebkhan85
Copy link
Contributor Author

@ftynse I have introduced new test cases under mlir/test/Dialect/Linalg/continuous-tiling-full.mlir to demonstrate the full idea of how continuous tiling is applied.

While the examples here test for static dimension sizes, we have encountered an issue for the dynamic case where dimension sizes are not known, i.e, ?x?. It turns out that in this case, multiway splitOp generates an "additional" trailing linalg op whose dimensions would evaluate to zero at runtime (in case of generating chunk_sizes using continuous_tile_sizes). This can be attributed to linalg::splitOp creating a valid head and tail on the last chunkSize. This causes the foreach #93705 to fail in this case as linalg_splits and tile_sizes are not the same length. We could have a fully working dynamic example, if the last element in linalg_splits could be "dropped" in some way. One way to do this is to add zip_shortest to foreach when it takes more than one handles. If you have another suggestion or preference, please let us know.

Here's an example for the dynamic dimensions case. linalg_splits and tile_sizes do not have the same payload size (6 vs 5), which makes the foreach fail.

module attributes {transform.with_named_sequence} {
  transform.named_sequence @__transform_main(%arg1: !transform.any_op {transform.readonly}) {
    %0 = transform.structured.match ops{["linalg.matmul"]} in %arg1 : (!transform.any_op) -> !transform.any_op
    %tile_sizes, %chunk_sizes = transform.structured.continuous_tile_sizes %0 { dimension = 0, target_size = 9 } : (!transform.any_op) -> !transform.any_op
    %linalg_splits, %empty = transform.structured.split %0 after %chunk_sizes { dimension = 0, multiway } : !transform.any_op, !transform.any_op
    transform.foreach %linalg_splits, %tile_sizes : !transform.any_op, !transform.any_op {
    ^bb1(%linalg_split: !transform.any_op, %tile_size: !transform.any_op):
      %tiled_linalg_split, %dim0_loop = transform.structured.tile_using_for %linalg_split tile_sizes [%tile_size] : (!transform.any_op, !transform.any_op) -> (!transform.any_op, !transform.any_op)
      transform.yield
    }
    transform.yield
  }
}

func.func @continuous_tile_linalg_matmul(
  %arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>, %arg2: tensor<?x?xf32>)
    -> tensor<?x?xf32> {
  %0 = linalg.matmul  ins(%arg0, %arg1: tensor<?x?xf32>, tensor<?x?xf32>)
                     outs(%arg2: tensor<?x?xf32>)
    -> tensor<?x?xf32>

  return %0 : tensor<?x?xf32>
}

Add continuous tiling op `structured.continuous_tile_sizes`
to the transform dialect that returns as result (1) a list of
exponentially diminishing tile sizes, and (2) a list of chunk
sizes -- along the specified dimension of the target --
where the corresponding tile sizes from (1) can be applied.
The list of chunk sizes from (2) cover the entire iteration
space along the given dimension of the target.
Add functionality that enables SplitOp to do a multiway split of
a traget op along a given dimension. With multiway attribute,
SplitOp takes a list of chunk sizes and applies it to a single
target along the given dimension to generate multiple
structured ops extracted from the target.
Tests SplitOp for multiway splitting of a structured op using
the result of `continuous_tile_sizes` to specify the
chunk sizes that the target's specified dimesion should be
split into.
@muneebkhan85
Copy link
Contributor Author

@ftynse I have included a commit that solves the aforementioned issue with dynamic bounds, by introducing zip_shortest attribute and functionality to the foreach op. I have also included a test case for dynamic bounds in mlir/test/Dialect/Linalg/continuous-tiling-full.mlir.

If this looks like a viable solution to you, please help with merging this PR. Otherwise, please let me know if you have other suggestions.

Adds `zip_shortest` functionality to `foreach` so that when it takes
multiple handles of varying lengths - instead of failing - it shrinks
the size of all payloads to that of the shortest payload.
Introduce a full test case demonstrating the continuous tiling
idea end-to-end.
Copy link
Member

@ftynse ftynse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a syntactic comment, but I'll land this and let you consider that comment in a separate PR.

let results = (outs Variadic<Transform_AnyHandleOrParamType>:$results);
let regions = (region SizedRegion<1>:$body);
let assemblyFormat =
"$targets `:` type($targets) (`->` type($results)^)? $body attr-dict";
"$targets attr-dict `:` type($targets) (`->` type($results)^)? $body";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about turning the attribute into a keyword rather than putting the entire dictionary here, which looks rather awkward?

@ftynse ftynse merged commit a9efcbf into llvm:main Jun 21, 2024
6 checks passed
Copy link

@muneebkhan85 Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested
by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as
the builds can include changes from many authors. It is not uncommon for your
change to be included in a build that fails due to someone else's changes, or
infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself.
This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

AlexisPerry pushed a commit to llvm-project-tlp/llvm-project that referenced this pull request Jul 9, 2024
This patch enables continuous tiling of a target structured op using
diminishing tile sizes. In cases where the tensor dimensions are not
exactly divisible by the tile size, we are left with leftover tensor
chunks that are irregularly tiled. This approach enables tiling of the
leftover chunk with a smaller tile size and repeats this process
recursively using exponentially diminishing tile sizes. This eventually
generates a chain of loops that apply tiling using diminishing tile
sizes.

Adds `continuous_tile_sizes` op to the transform dialect. This op, when
given a tile size and a dimension, computes a series of diminishing tile
sizes that can be used to tile the target along the given dimension.
Additionally, this op also generates a series of chunk sizes that the
corresponding tile sizes should be applied to along the given dimension.

Adds `multiway` attribute to `transform.structured.split` that enables
multiway splitting of a single target op along the given dimension, as
specified in a list enumerating the chunk sizes.
ftynse pushed a commit that referenced this pull request Jul 19, 2024
#98492)

This PR addresses a [comment] made by @ftynse about the syntax for
`ForeachOp`. The syntax was modified by @muneebkhan85 in #82792, where
the attribute dictionary was moved to the middle.
This patch moves it back to its original place at the end. And
introduces an optional keyword for `zip_shortest`.

[comment]:
#82792 (review)
yuxuanchen1997 pushed a commit that referenced this pull request Jul 25, 2024
#98492)

Summary:
This PR addresses a [comment] made by @ftynse about the syntax for
`ForeachOp`. The syntax was modified by @muneebkhan85 in #82792, where
the attribute dictionary was moved to the middle.
This patch moves it back to its original place at the end. And
introduces an optional keyword for `zip_shortest`.

[comment]:
#82792 (review)

Test Plan: 

Reviewers: 

Subscribers: 

Tasks: 

Tags: 


Differential Revision: https://phabricator.intern.facebook.com/D60251443
pabloantoniom pushed a commit that referenced this pull request Nov 15, 2024
#111171)

Follow-up a review comment from
#82792 (comment)
as a separate PR:

	E.g.:
	```
	%0:2 = transform.structured.split
	```
	is changed to
	```
	%t = transform.structured.split
	%0:2 = transform.split_handle %t
	```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants