Skip to content

[mlir] Enable decoupling two kinds of greedy behavior. #104649

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Dec 20, 2024

Conversation

jpienaar
Copy link
Member

@jpienaar jpienaar commented Aug 16, 2024

The greedy rewriter is used in many different flows and it has a lot of convenience (work list management, debugging actions, tracing, etc). But it combines two kinds of greedy behavior 1) how ops are matched, 2) folding wherever it can.

These are independent forms of greedy and leads to inefficiency. E.g., cases where one need to create different phases in lowering and is required to applying patterns in specific order split across different passes. Using the driver one ends up needlessly retrying folding/having multiple rounds of folding attempts, where one final run would have sufficed.

Of course folks can locally avoid this behavior by just building their own, but this is also a common requested feature that folks keep on working around locally in suboptimal ways.

For downstream users, there should be no behavioral change. Updating from the deprecated should just be a find and replace (e.g., find ./ -type f -exec sed -i 's|applyPatternsAndFoldGreedily|applyPatternsGreedily|g' {} \; variety) as the API hasn't changed between the two.

@llvmbot llvmbot added mlir:core MLIR Core Infrastructure mlir labels Aug 16, 2024
@llvmbot
Copy link
Member

llvmbot commented Aug 16, 2024

@llvm/pr-subscribers-flang-fir-hlfir
@llvm/pr-subscribers-mlir-vector
@llvm/pr-subscribers-mlir-neon
@llvm/pr-subscribers-mlir-llvm
@llvm/pr-subscribers-mlir-gpu
@llvm/pr-subscribers-mlir-sve
@llvm/pr-subscribers-mlir-memref
@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-mlir-core

Author: Jacques Pienaar (jpienaar)

Changes

The greedy rewriter is used in many different flows and it has a lot of convenience (work list management, debugging actions, tracing, etc). But it combines two kinds of greedy behavior 1) how ops are matched, 2) folding wherever it can.

These are independent forms of greedy and leads to inefficiency. E.g., cases where one need to create different phases in lowering and is required to applying patterns in specific order split across different passes. Using the driver one ends up needlessly retrying folding/having multiple rounds of folding attempts, where one final run would have sufficed.

Of course folks can locally avoid this behavior by just building their own, but this is also a common requested feature that folks keep on working around locally in suboptimal ways.


Full diff: https://github.com/llvm/llvm-project/pull/104649.diff

2 Files Affected:

  • (modified) mlir/include/mlir/Transforms/GreedyPatternRewriteDriver.h (+46-12)
  • (modified) mlir/lib/Transforms/Utils/GreedyPatternRewriteDriver.cpp (+11-6)
diff --git a/mlir/include/mlir/Transforms/GreedyPatternRewriteDriver.h b/mlir/include/mlir/Transforms/GreedyPatternRewriteDriver.h
index eaff85804f6b3d..061cdd4b7d4d94 100644
--- a/mlir/include/mlir/Transforms/GreedyPatternRewriteDriver.h
+++ b/mlir/include/mlir/Transforms/GreedyPatternRewriteDriver.h
@@ -91,6 +91,15 @@ class GreedyRewriteConfig {
 
   /// An optional listener that should be notified about IR modifications.
   RewriterBase::Listener *listener = nullptr;
+
+  // Whether this should fold while greedily rewriting.
+  //
+  // Note: greedy here generally refers to two forms, 1) greedily applying
+  // patterns based purely on benefit and applying without backtracking using
+  // default cost model, 2) greedily folding where possible while attempting to
+  // match and rewrite using the provided patterns. With this option set to
+  // false it only does the former.
+  bool fold = true;
 };
 
 //===----------------------------------------------------------------------===//
@@ -104,8 +113,8 @@ class GreedyRewriteConfig {
 /// The greedy rewrite may prematurely stop after a maximum number of
 /// iterations, which can be configured in the configuration parameter.
 ///
-/// Also performs folding and simple dead-code elimination before attempting to
-/// match any of the provided patterns.
+/// Also performs simple dead-code elimination before attempting to match any of
+/// the provided patterns.
 ///
 /// A region scope can be set in the configuration parameter. By default, the
 /// scope is set to the specified region. Only in-scope ops are added to the
@@ -117,10 +126,18 @@ class GreedyRewriteConfig {
 ///
 /// Note: This method does not apply patterns to the region's parent operation.
 LogicalResult
+applyPatternsGreedily(Region &region, const FrozenRewritePatternSet &patterns,
+                      GreedyRewriteConfig config = GreedyRewriteConfig(),
+                      bool *changed = nullptr);
+/// Same as `applyPatternsAndGreedily` above with folding.
+inline LogicalResult
 applyPatternsAndFoldGreedily(Region &region,
                              const FrozenRewritePatternSet &patterns,
                              GreedyRewriteConfig config = GreedyRewriteConfig(),
-                             bool *changed = nullptr);
+                             bool *changed = nullptr) {
+  config.fold = true;
+  return applyPatternsGreedily(region, patterns, config, changed);
+}
 
 /// Rewrite ops nested under the given operation, which must be isolated from
 /// above, by repeatedly applying the highest benefit patterns in a greedy
@@ -129,8 +146,8 @@ applyPatternsAndFoldGreedily(Region &region,
 /// The greedy rewrite may prematurely stop after a maximum number of
 /// iterations, which can be configured in the configuration parameter.
 ///
-/// Also performs folding and simple dead-code elimination before attempting to
-/// match any of the provided patterns.
+/// Also performs simple dead-code elimination before attempting to match any of
+/// the provided patterns.
 ///
 /// This overload runs a separate greedy rewrite for each region of the
 /// specified op. A region scope can be set in the configuration parameter. By
@@ -147,10 +164,9 @@ applyPatternsAndFoldGreedily(Region &region,
 ///
 /// Note: This method does not apply patterns to the given operation itself.
 inline LogicalResult
-applyPatternsAndFoldGreedily(Operation *op,
-                             const FrozenRewritePatternSet &patterns,
-                             GreedyRewriteConfig config = GreedyRewriteConfig(),
-                             bool *changed = nullptr) {
+applyPatternsGreedily(Operation *op, const FrozenRewritePatternSet &patterns,
+                      GreedyRewriteConfig config = GreedyRewriteConfig(),
+                      bool *changed = nullptr) {
   bool anyRegionChanged = false;
   bool failed = false;
   for (Region &region : op->getRegions()) {
@@ -164,6 +180,15 @@ applyPatternsAndFoldGreedily(Operation *op,
     *changed = anyRegionChanged;
   return failure(failed);
 }
+/// Same as `applyPatternsGreedily` above with folding.
+inline LogicalResult
+applyPatternsAndFoldGreedily(Operation *op,
+                             const FrozenRewritePatternSet &patterns,
+                             GreedyRewriteConfig config = GreedyRewriteConfig(),
+                             bool *changed = nullptr) {
+  config.fold = true;
+  return applyPatternsGreedily(op, patterns, config, changed);
+}
 
 /// Rewrite the specified ops by repeatedly applying the highest benefit
 /// patterns in a greedy worklist driven manner until a fixpoint is reached.
@@ -171,8 +196,8 @@ applyPatternsAndFoldGreedily(Operation *op,
 /// The greedy rewrite may prematurely stop after a maximum number of
 /// iterations, which can be configured in the configuration parameter.
 ///
-/// Also performs folding and simple dead-code elimination before attempting to
-/// match any of the provided patterns.
+/// Also performs simple dead-code elimination before attempting to match any of
+/// the provided patterns.
 ///
 /// Newly created ops and other pre-existing ops that use results of rewritten
 /// ops or supply operands to such ops are also processed, unless such ops are
@@ -194,10 +219,19 @@ applyPatternsAndFoldGreedily(Operation *op,
 /// the IR was modified at all. `allOpsErased` is set to "true" if all ops in
 /// `ops` were erased.
 LogicalResult
+applyOpPatternsGreedily(ArrayRef<Operation *> ops,
+                        const FrozenRewritePatternSet &patterns,
+                        GreedyRewriteConfig config = GreedyRewriteConfig(),
+                        bool *changed = nullptr, bool *allErased = nullptr);
+/// Same as `applyOpPatternsGreedily` with folding.
+inline LogicalResult
 applyOpPatternsAndFold(ArrayRef<Operation *> ops,
                        const FrozenRewritePatternSet &patterns,
                        GreedyRewriteConfig config = GreedyRewriteConfig(),
-                       bool *changed = nullptr, bool *allErased = nullptr);
+                       bool *changed = nullptr, bool *allErased = nullptr) {
+  config.fold = true;
+  return applyOpPatternsGreedily(ops, patterns, config, changed, allErased);
+}
 
 } // namespace mlir
 
diff --git a/mlir/lib/Transforms/Utils/GreedyPatternRewriteDriver.cpp b/mlir/lib/Transforms/Utils/GreedyPatternRewriteDriver.cpp
index e0d0acd122e26b..4e8b74620da5fe 100644
--- a/mlir/lib/Transforms/Utils/GreedyPatternRewriteDriver.cpp
+++ b/mlir/lib/Transforms/Utils/GreedyPatternRewriteDriver.cpp
@@ -6,7 +6,7 @@
 //
 //===----------------------------------------------------------------------===//
 //
-// This file implements mlir::applyPatternsAndFoldGreedily.
+// This file implements mlir::applyPatternsGreedily.
 //
 //===----------------------------------------------------------------------===//
 
@@ -488,7 +488,7 @@ bool GreedyPatternRewriteDriver::processWorklist() {
     // infinite folding loop, as every constant op would be folded to an
     // Attribute and then immediately be rematerialized as a constant op, which
     // is then put on the worklist.
-    if (!op->hasTrait<OpTrait::ConstantLike>()) {
+    if (config.fold && !op->hasTrait<OpTrait::ConstantLike>()) {
       SmallVector<OpFoldResult> foldResults;
       if (succeeded(op->fold(foldResults))) {
         LLVM_DEBUG(logResultWithLine("success", "operation was folded"));
@@ -840,6 +840,11 @@ LogicalResult RegionPatternRewriteDriver::simplify(bool *changed) && {
     // regions to enable more aggressive CSE'ing).
     OperationFolder folder(ctx, this);
     auto insertKnownConstant = [&](Operation *op) {
+      // This hoisting is to enable more folding, so skip checking if known
+      // constant, updating dense map etc if not doing folding.
+      if (!config.fold)
+        return false;
+
       // Check for existing constants when populating the worklist. This avoids
       // accidentally reversing the constant order during processing.
       Attribute constValue;
@@ -894,9 +899,9 @@ LogicalResult RegionPatternRewriteDriver::simplify(bool *changed) && {
 }
 
 LogicalResult
-mlir::applyPatternsAndFoldGreedily(Region &region,
-                                   const FrozenRewritePatternSet &patterns,
-                                   GreedyRewriteConfig config, bool *changed) {
+mlir::applyPatternsGreedily(Region &region,
+                            const FrozenRewritePatternSet &patterns,
+                            GreedyRewriteConfig config, bool *changed) {
   // The top-level operation must be known to be isolated from above to
   // prevent performing canonicalizations on operations defined at or above
   // the region containing 'op'.
@@ -1012,7 +1017,7 @@ static Region *findCommonAncestor(ArrayRef<Operation *> ops) {
   return region;
 }
 
-LogicalResult mlir::applyOpPatternsAndFold(
+LogicalResult mlir::applyOpPatternsGreedily(
     ArrayRef<Operation *> ops, const FrozenRewritePatternSet &patterns,
     GreedyRewriteConfig config, bool *changed, bool *allErased) {
   if (ops.empty()) {

@llvmbot
Copy link
Member

llvmbot commented Aug 16, 2024

@llvm/pr-subscribers-mlir

Author: Jacques Pienaar (jpienaar)

Changes

The greedy rewriter is used in many different flows and it has a lot of convenience (work list management, debugging actions, tracing, etc). But it combines two kinds of greedy behavior 1) how ops are matched, 2) folding wherever it can.

These are independent forms of greedy and leads to inefficiency. E.g., cases where one need to create different phases in lowering and is required to applying patterns in specific order split across different passes. Using the driver one ends up needlessly retrying folding/having multiple rounds of folding attempts, where one final run would have sufficed.

Of course folks can locally avoid this behavior by just building their own, but this is also a common requested feature that folks keep on working around locally in suboptimal ways.


Full diff: https://github.com/llvm/llvm-project/pull/104649.diff

2 Files Affected:

  • (modified) mlir/include/mlir/Transforms/GreedyPatternRewriteDriver.h (+46-12)
  • (modified) mlir/lib/Transforms/Utils/GreedyPatternRewriteDriver.cpp (+11-6)
diff --git a/mlir/include/mlir/Transforms/GreedyPatternRewriteDriver.h b/mlir/include/mlir/Transforms/GreedyPatternRewriteDriver.h
index eaff85804f6b3d..061cdd4b7d4d94 100644
--- a/mlir/include/mlir/Transforms/GreedyPatternRewriteDriver.h
+++ b/mlir/include/mlir/Transforms/GreedyPatternRewriteDriver.h
@@ -91,6 +91,15 @@ class GreedyRewriteConfig {
 
   /// An optional listener that should be notified about IR modifications.
   RewriterBase::Listener *listener = nullptr;
+
+  // Whether this should fold while greedily rewriting.
+  //
+  // Note: greedy here generally refers to two forms, 1) greedily applying
+  // patterns based purely on benefit and applying without backtracking using
+  // default cost model, 2) greedily folding where possible while attempting to
+  // match and rewrite using the provided patterns. With this option set to
+  // false it only does the former.
+  bool fold = true;
 };
 
 //===----------------------------------------------------------------------===//
@@ -104,8 +113,8 @@ class GreedyRewriteConfig {
 /// The greedy rewrite may prematurely stop after a maximum number of
 /// iterations, which can be configured in the configuration parameter.
 ///
-/// Also performs folding and simple dead-code elimination before attempting to
-/// match any of the provided patterns.
+/// Also performs simple dead-code elimination before attempting to match any of
+/// the provided patterns.
 ///
 /// A region scope can be set in the configuration parameter. By default, the
 /// scope is set to the specified region. Only in-scope ops are added to the
@@ -117,10 +126,18 @@ class GreedyRewriteConfig {
 ///
 /// Note: This method does not apply patterns to the region's parent operation.
 LogicalResult
+applyPatternsGreedily(Region &region, const FrozenRewritePatternSet &patterns,
+                      GreedyRewriteConfig config = GreedyRewriteConfig(),
+                      bool *changed = nullptr);
+/// Same as `applyPatternsAndGreedily` above with folding.
+inline LogicalResult
 applyPatternsAndFoldGreedily(Region &region,
                              const FrozenRewritePatternSet &patterns,
                              GreedyRewriteConfig config = GreedyRewriteConfig(),
-                             bool *changed = nullptr);
+                             bool *changed = nullptr) {
+  config.fold = true;
+  return applyPatternsGreedily(region, patterns, config, changed);
+}
 
 /// Rewrite ops nested under the given operation, which must be isolated from
 /// above, by repeatedly applying the highest benefit patterns in a greedy
@@ -129,8 +146,8 @@ applyPatternsAndFoldGreedily(Region &region,
 /// The greedy rewrite may prematurely stop after a maximum number of
 /// iterations, which can be configured in the configuration parameter.
 ///
-/// Also performs folding and simple dead-code elimination before attempting to
-/// match any of the provided patterns.
+/// Also performs simple dead-code elimination before attempting to match any of
+/// the provided patterns.
 ///
 /// This overload runs a separate greedy rewrite for each region of the
 /// specified op. A region scope can be set in the configuration parameter. By
@@ -147,10 +164,9 @@ applyPatternsAndFoldGreedily(Region &region,
 ///
 /// Note: This method does not apply patterns to the given operation itself.
 inline LogicalResult
-applyPatternsAndFoldGreedily(Operation *op,
-                             const FrozenRewritePatternSet &patterns,
-                             GreedyRewriteConfig config = GreedyRewriteConfig(),
-                             bool *changed = nullptr) {
+applyPatternsGreedily(Operation *op, const FrozenRewritePatternSet &patterns,
+                      GreedyRewriteConfig config = GreedyRewriteConfig(),
+                      bool *changed = nullptr) {
   bool anyRegionChanged = false;
   bool failed = false;
   for (Region &region : op->getRegions()) {
@@ -164,6 +180,15 @@ applyPatternsAndFoldGreedily(Operation *op,
     *changed = anyRegionChanged;
   return failure(failed);
 }
+/// Same as `applyPatternsGreedily` above with folding.
+inline LogicalResult
+applyPatternsAndFoldGreedily(Operation *op,
+                             const FrozenRewritePatternSet &patterns,
+                             GreedyRewriteConfig config = GreedyRewriteConfig(),
+                             bool *changed = nullptr) {
+  config.fold = true;
+  return applyPatternsGreedily(op, patterns, config, changed);
+}
 
 /// Rewrite the specified ops by repeatedly applying the highest benefit
 /// patterns in a greedy worklist driven manner until a fixpoint is reached.
@@ -171,8 +196,8 @@ applyPatternsAndFoldGreedily(Operation *op,
 /// The greedy rewrite may prematurely stop after a maximum number of
 /// iterations, which can be configured in the configuration parameter.
 ///
-/// Also performs folding and simple dead-code elimination before attempting to
-/// match any of the provided patterns.
+/// Also performs simple dead-code elimination before attempting to match any of
+/// the provided patterns.
 ///
 /// Newly created ops and other pre-existing ops that use results of rewritten
 /// ops or supply operands to such ops are also processed, unless such ops are
@@ -194,10 +219,19 @@ applyPatternsAndFoldGreedily(Operation *op,
 /// the IR was modified at all. `allOpsErased` is set to "true" if all ops in
 /// `ops` were erased.
 LogicalResult
+applyOpPatternsGreedily(ArrayRef<Operation *> ops,
+                        const FrozenRewritePatternSet &patterns,
+                        GreedyRewriteConfig config = GreedyRewriteConfig(),
+                        bool *changed = nullptr, bool *allErased = nullptr);
+/// Same as `applyOpPatternsGreedily` with folding.
+inline LogicalResult
 applyOpPatternsAndFold(ArrayRef<Operation *> ops,
                        const FrozenRewritePatternSet &patterns,
                        GreedyRewriteConfig config = GreedyRewriteConfig(),
-                       bool *changed = nullptr, bool *allErased = nullptr);
+                       bool *changed = nullptr, bool *allErased = nullptr) {
+  config.fold = true;
+  return applyOpPatternsGreedily(ops, patterns, config, changed, allErased);
+}
 
 } // namespace mlir
 
diff --git a/mlir/lib/Transforms/Utils/GreedyPatternRewriteDriver.cpp b/mlir/lib/Transforms/Utils/GreedyPatternRewriteDriver.cpp
index e0d0acd122e26b..4e8b74620da5fe 100644
--- a/mlir/lib/Transforms/Utils/GreedyPatternRewriteDriver.cpp
+++ b/mlir/lib/Transforms/Utils/GreedyPatternRewriteDriver.cpp
@@ -6,7 +6,7 @@
 //
 //===----------------------------------------------------------------------===//
 //
-// This file implements mlir::applyPatternsAndFoldGreedily.
+// This file implements mlir::applyPatternsGreedily.
 //
 //===----------------------------------------------------------------------===//
 
@@ -488,7 +488,7 @@ bool GreedyPatternRewriteDriver::processWorklist() {
     // infinite folding loop, as every constant op would be folded to an
     // Attribute and then immediately be rematerialized as a constant op, which
     // is then put on the worklist.
-    if (!op->hasTrait<OpTrait::ConstantLike>()) {
+    if (config.fold && !op->hasTrait<OpTrait::ConstantLike>()) {
       SmallVector<OpFoldResult> foldResults;
       if (succeeded(op->fold(foldResults))) {
         LLVM_DEBUG(logResultWithLine("success", "operation was folded"));
@@ -840,6 +840,11 @@ LogicalResult RegionPatternRewriteDriver::simplify(bool *changed) && {
     // regions to enable more aggressive CSE'ing).
     OperationFolder folder(ctx, this);
     auto insertKnownConstant = [&](Operation *op) {
+      // This hoisting is to enable more folding, so skip checking if known
+      // constant, updating dense map etc if not doing folding.
+      if (!config.fold)
+        return false;
+
       // Check for existing constants when populating the worklist. This avoids
       // accidentally reversing the constant order during processing.
       Attribute constValue;
@@ -894,9 +899,9 @@ LogicalResult RegionPatternRewriteDriver::simplify(bool *changed) && {
 }
 
 LogicalResult
-mlir::applyPatternsAndFoldGreedily(Region &region,
-                                   const FrozenRewritePatternSet &patterns,
-                                   GreedyRewriteConfig config, bool *changed) {
+mlir::applyPatternsGreedily(Region &region,
+                            const FrozenRewritePatternSet &patterns,
+                            GreedyRewriteConfig config, bool *changed) {
   // The top-level operation must be known to be isolated from above to
   // prevent performing canonicalizations on operations defined at or above
   // the region containing 'op'.
@@ -1012,7 +1017,7 @@ static Region *findCommonAncestor(ArrayRef<Operation *> ops) {
   return region;
 }
 
-LogicalResult mlir::applyOpPatternsAndFold(
+LogicalResult mlir::applyOpPatternsGreedily(
     ArrayRef<Operation *> ops, const FrozenRewritePatternSet &patterns,
     GreedyRewriteConfig config, bool *changed, bool *allErased) {
   if (ops.empty()) {

@joker-eph
Copy link
Collaborator

cases where one need to create different phases in lowering and is required to applying patterns in specific order split across different passes

I don’t understand what this means, or how it relates to greedy, or what is the actual problem to solve right now…

this is also a common requested feature that folks keep on working around locally in suboptimal ways.

I am slightly concerned that this request comes from some wrong design or incorrect root causing.
(The request isn’t common enough that I faced it by the way)

@jpienaar
Copy link
Member Author

I don’t understand what this means, or how it relates to greedy, or what is the actual problem to solve right now…

If one has a set of patterns that have to all be applied before another set of patterns, then one needs to split it into multiple passes today (the priorities are not sufficient depending on iteration order). Now one ends up with trying to do the same folding multiple times needlessly vs if you just did it in one. This enables using different sets of patterns relatively cheaply, rather than all checks for folding being wasted until last/if ever. A lot of interface and dialect fallback hook querying until then, all wasted: 38% of total execution time spent in mlir::detail::constant_op_binder::match, of that significant time in hasTrait called from there in the profile I'm looking at. I don't think there is any folding even happening in the end, just a lot of attempts made (e.g., its not expensive folders being an issue, just actually querying values to see if they are constant and a folder can be used is).

@jpienaar
Copy link
Member Author

To make it more concrete:

All passes/pipelines combined 140s
applyPatternsAndFoldGreedily 105s

  • fold 93s
    • constant matching 63s (there are some additional constant matching outside folding, < 0.5s)
      • hasTrait related to 57s

So ~50% of time is spent just checking if an operation is a constant (don't have all the line info to see time spent in repopulating constant hash map etc). The only ways to reduce overhead at the moment is to combine into larger patternsets (only sometimes possible) or not using the rewrite driver (which means you either lose the worklist management & nicer debug/tracing facilities or just create duplication).

Copy link
Member

@matthias-springer matthias-springer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea to make the driver more customizable, esp. if it can help with performance. By specifying patterns, users have control over which ops will be processed. But it is currently not possible to control which ops will be visited for folding purposes, so all ops are visited.

@@ -91,6 +91,15 @@ class GreedyRewriteConfig {

/// An optional listener that should be notified about IR modifications.
RewriterBase::Listener *listener = nullptr;

// Whether this should fold while greedily rewriting.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: tripple ///

@@ -840,6 +840,11 @@ LogicalResult RegionPatternRewriteDriver::simplify(bool *changed) && {
// regions to enable more aggressive CSE'ing).
OperationFolder folder(ctx, this);
auto insertKnownConstant = [&](Operation *op) {
// This hoisting is to enable more folding, so skip checking if known
// constant, updating dense map etc if not doing folding.
if (!config.fold)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insertKnownConstant is not really doing "folding": it does not call the folder. What it does is CSE'ing constants. So I would go as far as adding a second config flag: cseConstants (see #89552). Alternatively, update the GreedyRewriteConfig::fold documentation and mention that setting it to "false" also disables the CSE'ing of constants.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am quite concern with the accumulation of flags: this seems to me like working around problems instead of addressing them, and adding complexity that I don't see manageable on the long term.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What problem do you think is being worked around?

For me I see this as I'm a user that wants a worklist driven driver, I have a curated pipeline where I know there isn't any folding here (e.g., the ops being created have no folding, its part of pipeline where folding is ~0% likely, so the ROI on checking if it can be folded is none), but they have no way of communicating that intent. The problem is expert users not being able to express intent without crafting own/local forking.

Now, constant CSE'ing as Matthias correctly points out is different. It is documented here as enabling more folding, so disabled too (this is cheaper here too as all constant hoisting would have happened earlier and there may be no additional chance here to produce additional constants that would need CSE'ing). But it is different so I could see controlling separately is good. And don't mind doing that. I think this is a framework for high performance, so reasonably defaults that can be tuned seems reasonable to me. I considered the maintenance cost but specialized drivers that share most of the same code seemed higher maintenance cost to me.

@River707
Copy link
Contributor

River707 commented Aug 28, 2024

To make it more concrete:

All passes/pipelines combined 140s applyPatternsAndFoldGreedily 105s

  • fold 93s

    • constant matching 63s (there are some additional constant matching outside folding, < 0.5s)

      • hasTrait related to 57s

So ~50% of time is spent just checking if an operation is a constant (don't have all the line info to see time spent in repopulating constant hash map etc). The only ways to reduce overhead at the moment is to combine into larger patternsets (only sometimes possible) or not using the rewrite driver (which means you either lose the worklist management & nicer debug/tracing facilities or just create duplication).

Is the hasTrait purely the ConstantLike trait check? or cumulative checking for the various operands/etc.? Feels like there is a lot of low hanging fruit here for optimization, even outside of decoupling folding. Also, how many iterations per driver run are you seeing? I've kind of always felt that the multiple iteration stuff was a big hammer and not optimal in many ways.

Copy link
Collaborator

@joker-eph joker-eph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There aren't any tests.

@jpienaar
Copy link
Member Author

jpienaar commented Sep 2, 2024

Is the hasTrait purely the ConstantLike trait check? or cumulative checking for the various operands/etc.? Feels like there is a lot of low hanging fruit here for optimization, even outside of decoupling folding.

Its cumulative. This is across all pass executions for the pipeline, so there are multiple greedy pattern applications along with other passes/walks. There 3 iterations at most for any of them (2 being minimum if mutation is made). Now, this is the entire pipeline so at the start of pipeline there may be some cleanup value. But none of the dialects matched/produced implement folders, so in this pipeline beyond the first pass, making checking for constants cheaper just makes the wasted time lower (it doesn't really have a benefit, disabling it improves compile time significantly with no negative performance impact).

@jpienaar
Copy link
Member Author

There aren't any tests.

Fixed.

// NOCSE-DAG: arith.constant 2 : i32
// NOCSE-DAG: arith.constant 1 : i32
// NOCSE-DAG: arith.constant 2 : i32
// NOCSE-DAG: arith.constant 3 : i32
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really the behavior we want here? Seems like we still have some constant folding (more constants are introduced) but they aren't deduped!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused for second too :) This is the no CSE case. So we do have folding, just not constant depupe. This is flag Matthias suggested, so these two features are independent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment for fold says:

 /// Whether this should fold while greedily rewriting. This also disables
   /// CSE'ing constants.
   bool fold = true;

Why aren't we keeping these independent then? Disabling Fold shouldn't disable CSE if the CSE option is true...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of date comment, sorry. It is/should be independent. I'll double check now.

Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we separate the new API (applyPatternsGreedily), switching the in-tree pases to use it, and deprecation into it into a few PRs?

I'm worried that landing all of these in one step can results in reverts that we can avoid with a gradual rollout... Especially with lots of folks disappearing for vacation and potentially not being around to investigate breakages in their areas of interest.

@MaheshRavishankar
Copy link
Contributor

Should we separate the new API (applyPatternsGreedily), switching the in-tree pases to use it, and deprecation into it into a few PRs?

I'm worried that landing all of these in one step can results in reverts that we can avoid with a gradual rollout... Especially with lots of folks disappearing for vacation and potentially not being around to investigate breakages in their areas of interest.

Won't the fold being set to true by default make this an NFC change?

@kuhar
Copy link
Member

kuhar commented Dec 19, 2024

Won't the fold being set to true by default make this an NFC change?

What tripped me is that in applyPatternsAndFoldGreedily this PR explicitly sets it to true, making me think the default must be false:

  config.fold = true;
  return applyPatternsGreedily(region, patterns, config, changed);
}

@jpienaar could you add a tl;dr section to the PR description that lists the API-level changes and suggests how to integrate this in downstream projects? The motivation is very clear, but what to do with the PR not necessarily.

Copy link
Contributor

@aartbik aartbik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay for sparse (assuming fold remains the default, since all tests pass... ;-)

@joker-eph
Copy link
Collaborator

What tripped me is that in applyPatternsAndFoldGreedily this PR explicitly sets it to true, making me think the default must be false:

No this does not change the default: it is because this API has "Fold" in the name that we override the flag to ensure preserving the existing behavior of this API. Having this override or not is NFC in the current condition though, it seems like mostly "defensive programming" here.

Folks having the deprecated message should not have anything else to do than a mass replace of applyPatternsAndFoldGreedily to applyPatternsGreedily (all the adjustments upstream in this diff are demonstrating this I believe).

@jpienaar
Copy link
Member Author

Yes this should be a NFC change. And yes just defense (the default is to fold, but the current API with fold in the name does take an option and so someone could have disabled folding in the config, called function that says it folds and then be surprised - I mean they really shouldn't :)). Yes I updated by simply doing find and replace textually. With current default nothing else needed. Let me update description to capture this.

The greedy rewriter is used in many different flows and it has a lot of
convenience (work list management, debugging actions, tracing, etc). But
it combines two kinds of greedy behavior 1) wrt how ops are matched, 2)
folding wherever it can.

These are independent forms of greedy and leads to inefficiency. E.g.,
cases where one need to create different phases in lowering, one is
required to applying patterns in specific order/different passes. But if
using the driver one ends up needlessly retrying folding or having
multiple rounds of folding attempts, where one final run would have
sufficed. It also is rather confusing to users that just want to apply
some patterns while having all the convenience and structure to have
unrelated changes to IR.

Of course folks can locally avoid this behavior by just building their
own, but this is also a common requested feature that folks keep on
working around locally in suboptimal ways.
This partially incorporates llvm#89552. I haven't exposed it on
canonicalizer pass as that could be distinct discussion.
@llvmbot llvmbot added flang Flang issues not falling into any other category flang:fir-hlfir labels Dec 20, 2024
Copy link

github-actions bot commented Dec 20, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@jpienaar jpienaar merged commit 09dfc57 into llvm:main Dec 20, 2024
4 of 6 checks passed
@kazutakahirata
Copy link
Contributor

With this patch, I am getting:

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp:535:13: error: 'applyPatternsAndF
oldGreedily' is deprecated: Use applyPatternsGreedily() instead [-Werror,-Wdeprecated-declarations]                                      
            applyPatternsAndFoldGreedily(getOperation(), std::move(patterns))))                                                          
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                                                                 
            applyPatternsGreedily

Is there any way you could take a look at this? Thanks!

@MacDue
Copy link
Member

MacDue commented Dec 20, 2024

As per the commit message, you should just replace call:

For downstream users, there should be no behavioral change. Updating from the deprecated should just be a find and replace (e.g., find ./ -type f -exec sed -i 's|applyPatternsAndFoldGreedily|applyPatternsGreedily|g' {} ; variety) as the API hasn't changed between the two.

@kazutakahirata
Copy link
Contributor

As per the commit message, you should just replace call:

For downstream users, there should be no behavioral change. Updating from the deprecated should just be a find and replace (e.g., find ./ -type f -exec sed -i 's|applyPatternsAndFoldGreedily|applyPatternsGreedily|g' {} ; variety) as the API hasn't changed between the two.

Thanks! I've checked in 9901906.

paul0403 added a commit to PennyLaneAI/catalyst that referenced this pull request Jun 3, 2025
**Context:**
We update the llvm version tagged by jax 0.6.0:
```
mhlo=617a9361d186199480c080c9e8c474a5e30c22d1
llvm=179d30f8c3fddd3c85056fd2b8e877a4a8513158
```

We also update Enzyme to the latest version, which is 0.0.180, at commit
`db0181320d6e425ee963bd496ed0d8dbb615be18`


**Description of the Change:**

Firstly, jax recently moved from the Google github organization to its
own jax-ml organization.
This means the urls, and the retrieval method for the underlying llvm
and mhlo git commit tags, needs to be updated. (Thanks @mehrdad2m !)

Now on to the actual changes. I will list the changes in increasing
complexity.
1. The new enzyme cmake target is `EnzymeStatic-21` (from 20)

2. Enzyme works with a later llvm then our target, so it has some llvm
intrinsics unknown to the one we are targeting. We patch them away. They
do not concern us since they are all intrinsics for nvidia backends.

3. `applyPatternsAndFoldGreedily` is removed. Drop-in replacement is
`applyPatternsGreedily`.
llvm/llvm-project#104649,
llvm/llvm-project#126701

4. ops with `CallOpInterface` must have two new optional attributes
`arg_attrs` and `res_attrs`
llvm/llvm-project#123176

5. `CallInterfaceCallable` objects now must be directly casted to the
callee `SymbolRefAttr`, i.e. `callee.get<SymbolRefAttr>()` ->
`cast<SymbolRefAttr>(callee)`
llvm/llvm-project@35e8989

6. The `lookupOrCreateFn` family of functions now return
`FailureOr<funcop>` instead of just `funcop`, so a `.value()` needs to
be used to retrieve the underlying `funcop`.
llvm/llvm-project@e84f6b6

7. The cpp api for `OneShotBufferizePassOptions` no longer needs
complicated lambdas for the type converter options. They can be set with
the `mlir::bufferization::LayoutMapOption::IdentityLayoutMap` options
directly.

8. The individual `match` and `rewrite` methods in pattern rewrites are
removed. Use the two-in-one `matchAndRewrite` instead.
llvm/llvm-project#129861

9. For rewrite patterns with 1-to-N convertions, a new `macthAndRewrite`
overload with `OneToNOpAdaptor` must be used. For us, this is only the
`catalyst.list*` ops. llvm/llvm-project#116470

10. The lowering of `cf::AssertOp` to llvm was split from the
overall`--covert-cf-to-llvm` pass. We need to manually call this
separate pattern for cf.assert duriing quantum to llvm dialect lowering,
where we also convert cf to llvm.
https://github.com/llvm/llvm-project/pull/120431/files 

11. The new mhlo depends on a
[shardy](https://github.com/openxla/shardy) dialect. Shardy is built
with bazel, not cmake. Building shardy ourselves would be very difficult
(not having bazel in our build ecosystem is a hard constraint, cc @mlxd
), and also not necessary (we just use mhlo for their "standard"
passes). We thus patch out all shardy components.

12. Three necessary passes were removed in mhlo:
`mhlo-legalize-control-flow`, `mhlo-legalize-to-std`,
`hlo-legalize-sort`
tensorflow/mlir-hlo@4a640be#diff-ef0d7e30da19a396ba036405a9ef636f8b1be194618b0a90f4602671fc2ef34d

tensorflow/mlir-hlo@2a5e267#diff-f8c7cb07b43593403e00e0dbf9983f0186b4eb70368cc99af3b924061f1ea46f
- Alongside the removal of `mhlo-legalize-to-std`, the cmake target
`MhloToStandard` was removed too.
  We simply patch them back for now. 
  
**For the above two points, note that there will be an overall migration
to the stablehlo repo, as mhlo is sunseting.
Therefore, spending too much time on this isn't necessary, so we just
patch.**

13. The new pattern applicator (`applyPatternsGreedily`) is more
aggressive in dead code elimination,
and is eliminating dead `Value`s in the adjoint gradient method.
The `nodealloc` function we generate for adjoint gradient lowering used
to only return the qreg, not the expval result. This causes the expval
op to be eliminated since it has no users.
This further causes wrong gradient results, since the entire program,
all ops included (regardless
of dead or not), impacts the gradient through chain rule.
To avoid this, we return the expval result as well.
In doing this, we implicitly assume that differentiated qnodes can only
return expval.
Although this assumption is true and also restricted by frontend,
ideally we should not have it hard coded.
We leave this as a TODO for a future feature.

14. The old `--buffer-deallocation` pass is removed. Intended
replacement is `--buffer-deallocation-pipeline`.
This migration is very complicated. We simply add back the old buffer
deallocation pass in the catalyst dialect as a util for now.
We will revisit this in #1778 . 


mlir lit test updates:
1. `bufferization.to_tensor/memref` updated assembly format
2. gradient adjoint lowering test returns both qreg and expval
3. Some inverse unrealized conversion cast pairs are canceled by the new
pattern rewriter.
4. `llvm.mlir.undef` is deprecated, use `llvm.mlir.poison` instead.
llvm/llvm-project#125629


**Benefits:**
Up to date with upstream versions.


[sc-92017]

---------

Co-authored-by: Tzung-Han Juang <tzunghan.juang@gmail.com>
Co-authored-by: Ritu Thombre <42207923+ritu-thombre99@users.noreply.github.com>
Co-authored-by: Mehrdad Malekmohammadi <mehrdad.malek@xanadu.ai>
Co-authored-by: Mehrdad Malek <39844030+mehrdad2m@users.noreply.github.com>
Co-authored-by: David Ittah <dime10@users.noreply.github.com>
Co-authored-by: Joey Carter <joseph.carter@xanadu.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants