[flang][OpenMP] Map simple `do concurrent` loops to OpenMP host constructs #127633

ergawy · 2025-02-18T14:00:14Z

Upstreams one more part of the ROCm do concurrent to OpenMP mapping pass. This PR add support for converting simple loops to the equivalent OpenMP constructs on the host: omp parallel do. Towards that end, we have to collect more information about loop nests for which we add new utils in the looputils name space.

PR stack:

llvmbot · 2025-02-18T14:00:59Z

@llvm/pr-subscribers-flang-fir-hlfir

Author: Kareem Ergawy (ergawy)

Changes

Upstreams one more part of the ROCm do concurrent to OpenMP mapping pass. This PR add support for converting simple loops to the equivalent OpenMP constructs on the host: omp parallel do. Towards that end, we have to collect more information about loop nests for which we add new utils in the looputils name space.

Patch is 21.94 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/127633.diff

6 Files Affected:

(modified) flang/docs/DoConcurrentConversionToOpenMP.md (+47)
(modified) flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp (+246-10)
(modified) flang/test/Transforms/DoConcurrent/basic_host.f90 (+5-9)
(added) flang/test/Transforms/DoConcurrent/basic_host.mlir (+62)
(added) flang/test/Transforms/DoConcurrent/non_const_bounds.f90 (+45)
(added) flang/test/Transforms/DoConcurrent/not_perfectly_nested.f90 (+45)

diff --git a/flang/docs/DoConcurrentConversionToOpenMP.md b/flang/docs/DoConcurrentConversionToOpenMP.md
index de2525dd8b57d..914ace0813f0e 100644
--- a/flang/docs/DoConcurrentConversionToOpenMP.md
+++ b/flang/docs/DoConcurrentConversionToOpenMP.md
@@ -126,6 +126,53 @@ see the "Data environment" section below.
 See `flang/test/Transforms/DoConcurrent/loop_nest_test.f90` for more examples
 of what is and is not detected as a perfect loop nest.
 
+### Single-range loops
+
+Given the following loop:
+```fortran
+  do concurrent(i=1:n)
+    a(i) = i * i
+  end do
+```
+
+#### Mapping to `host`
+
+Mapping this loop to the `host`, generates MLIR operations of the following
+structure:
+
+```
+%4 = fir.address_of(@_QFEa) ...
+%6:2 = hlfir.declare %4 ...
+
+omp.parallel {
+  // Allocate private copy for `i`.
+  // TODO Use delayed privatization.
+  %19 = fir.alloca i32 {bindc_name = "i"}
+  %20:2 = hlfir.declare %19 {uniq_name = "_QFEi"} ...
+
+  omp.wsloop {
+    omp.loop_nest (%arg0) : index = (%21) to (%22) inclusive step (%c1_2) {
+      %23 = fir.convert %arg0 : (index) -> i32
+      // Use the privatized version of `i`.
+      fir.store %23 to %20#1 : !fir.ref<i32>
+      ...
+
+      // Use "shared" SSA value of `a`.
+      %42 = hlfir.designate %6#0
+      hlfir.assign %35 to %42
+      ...
+      omp.yield
+    }
+    omp.terminator
+  }
+  omp.terminator
+}
+```
+
+#### Mapping to `device`
+
+<!-- TODO -->
+
 <!--
 More details about current status will be added along with relevant parts of the
 implementation in later upstreaming patches.
diff --git a/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp b/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp
index 19048a03e254e..dc797877ac87b 100644
--- a/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp
+++ b/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp
@@ -11,6 +11,7 @@
 #include "flang/Optimizer/OpenMP/Utils.h"
 #include "mlir/Analysis/SliceAnalysis.h"
 #include "mlir/Dialect/OpenMP/OpenMPDialect.h"
+#include "mlir/IR/IRMapping.h"
 #include "mlir/Transforms/DialectConversion.h"
 #include "mlir/Transforms/RegionUtils.h"
 
@@ -24,8 +25,126 @@ namespace flangomp {
 
 namespace {
 namespace looputils {
-using LoopNest = llvm::SetVector<fir::DoLoopOp>;
+/// Stores info needed about the induction/iteration variable for each `do
+/// concurrent` in a loop nest. This includes:
+/// * the operation allocating memory for iteration variable,
+/// * the operation(s) updating the iteration variable with the current
+///   iteration number.
+struct InductionVariableInfo {
+  mlir::Operation *iterVarMemDef;
+  llvm::SetVector<mlir::Operation *> indVarUpdateOps;
+};
+
+using LoopNestToIndVarMap =
+    llvm::MapVector<fir::DoLoopOp, InductionVariableInfo>;
+
+/// Given an operation `op`, this returns true if one of `op`'s operands is
+/// "ultimately" the loop's induction variable. This helps in cases where the
+/// induction variable's use is "hidden" behind a convert/cast.
+///
+/// For example, give the following loop:
+/// ```
+///   fir.do_loop %ind_var = %lb to %ub step %s unordered {
+///     %ind_var_conv = fir.convert %ind_var : (index) -> i32
+///     fir.store %ind_var_conv to %i#1 : !fir.ref<i32>
+///     ...
+///   }
+/// ```
+///
+/// If \p op is the `fir.store` operation, then this function will return true
+/// since the IV is the "ultimate" opeerand to the `fir.store` op through the
+/// `%ind_var_conv` -> `%ind_var` conversion sequence.
+///
+/// For why this is useful, see its use in `findLoopIndVarMemDecl`.
+bool isIndVarUltimateOperand(mlir::Operation *op, fir::DoLoopOp doLoop) {
+  while (op != nullptr && op->getNumOperands() > 0) {
+    auto ivIt = llvm::find_if(op->getOperands(), [&](mlir::Value operand) {
+      return operand == doLoop.getInductionVar();
+    });
+
+    if (ivIt != op->getOperands().end())
+      return true;
+
+    op = op->getOperand(0).getDefiningOp();
+  }
+
+  return false;
+}
+
+/// For the \p doLoop parameter, find the operation that declares its iteration
+/// variable or allocates memory for it.
+///
+/// For example, give the following loop:
+/// ```
+///   ...
+///   %i:2 = hlfir.declare %0 {uniq_name = "_QFEi"} : ...
+///   ...
+///   fir.do_loop %ind_var = %lb to %ub step %s unordered {
+///     %ind_var_conv = fir.convert %ind_var : (index) -> i32
+///     fir.store %ind_var_conv to %i#1 : !fir.ref<i32>
+///     ...
+///   }
+/// ```
+///
+/// This function returns the `hlfir.declare` op for `%i`.
+mlir::Operation *findLoopIterationVarMemDecl(fir::DoLoopOp doLoop) {
+  mlir::Value result = nullptr;
+  mlir::visitUsedValuesDefinedAbove(
+      doLoop.getRegion(), [&](mlir::OpOperand *operand) {
+        if (result)
+          return;
+
+        if (isIndVarUltimateOperand(operand->getOwner(), doLoop)) {
+          assert(result == nullptr &&
+                 "loop can have only one induction variable");
+          result = operand->get();
+        }
+      });
+
+  assert(result != nullptr && result.getDefiningOp() != nullptr);
+  return result.getDefiningOp();
+}
 
+/// Collects the op(s) responsible for updating a loop's iteration variable with
+/// the current iteration number. For example, for the input IR:
+/// ```
+/// %i = fir.alloca i32 {bindc_name = "i"}
+/// %i_decl:2 = hlfir.declare %i ...
+/// ...
+/// fir.do_loop %i_iv = %lb to %ub step %step unordered {
+///   %1 = fir.convert %i_iv : (index) -> i32
+///   fir.store %1 to %i_decl#1 : !fir.ref<i32>
+///   ...
+/// }
+/// ```
+/// this function would return the first 2 ops in the `fir.do_loop`'s region.
+llvm::SetVector<mlir::Operation *>
+extractIndVarUpdateOps(fir::DoLoopOp doLoop) {
+  mlir::Value indVar = doLoop.getInductionVar();
+  llvm::SetVector<mlir::Operation *> indVarUpdateOps;
+
+  llvm::SmallVector<mlir::Value> toProcess;
+  toProcess.push_back(indVar);
+
+  llvm::DenseSet<mlir::Value> done;
+
+  while (!toProcess.empty()) {
+    mlir::Value val = toProcess.back();
+    toProcess.pop_back();
+
+    if (!done.insert(val).second)
+      continue;
+
+    for (mlir::Operation *user : val.getUsers()) {
+      indVarUpdateOps.insert(user);
+
+      for (mlir::Value result : user->getResults())
+        toProcess.push_back(result);
+    }
+  }
+
+  return std::move(indVarUpdateOps);
+}
 /// Loop \p innerLoop is considered perfectly-nested inside \p outerLoop iff
 /// there are no operations in \p outerloop's body other than:
 ///
@@ -93,11 +212,16 @@ bool isPerfectlyNested(fir::DoLoopOp outerLoop, fir::DoLoopOp innerLoop) {
 /// recognize a certain nested loop as part of the nest it just returns the
 /// parent loops it discovered before.
 mlir::LogicalResult collectLoopNest(fir::DoLoopOp currentLoop,
-                                    LoopNest &loopNest) {
+                                    LoopNestToIndVarMap &loopNest) {
   assert(currentLoop.getUnordered());
 
   while (true) {
-    loopNest.insert(currentLoop);
+    loopNest.try_emplace(
+        currentLoop,
+        InductionVariableInfo{
+            findLoopIterationVarMemDecl(currentLoop),
+            std::move(looputils::extractIndVarUpdateOps(currentLoop))});
+
     auto directlyNestedLoops = currentLoop.getRegion().getOps<fir::DoLoopOp>();
     llvm::SmallVector<fir::DoLoopOp> unorderedLoops;
 
@@ -127,13 +251,15 @@ class DoConcurrentConversion : public mlir::OpConversionPattern<fir::DoLoopOp> {
 public:
   using mlir::OpConversionPattern<fir::DoLoopOp>::OpConversionPattern;
 
-  DoConcurrentConversion(mlir::MLIRContext *context, bool mapToDevice)
-      : OpConversionPattern(context), mapToDevice(mapToDevice) {}
+  DoConcurrentConversion(mlir::MLIRContext *context, bool mapToDevice,
+                         llvm::DenseSet<fir::DoLoopOp> &concurrentLoopsToSkip)
+      : OpConversionPattern(context), mapToDevice(mapToDevice),
+        concurrentLoopsToSkip(concurrentLoopsToSkip) {}
 
   mlir::LogicalResult
   matchAndRewrite(fir::DoLoopOp doLoop, OpAdaptor adaptor,
                   mlir::ConversionPatternRewriter &rewriter) const override {
-    looputils::LoopNest loopNest;
+    looputils::LoopNestToIndVarMap loopNest;
     bool hasRemainingNestedLoops =
         failed(looputils::collectLoopNest(doLoop, loopNest));
     if (hasRemainingNestedLoops)
@@ -141,12 +267,120 @@ class DoConcurrentConversion : public mlir::OpConversionPattern<fir::DoLoopOp> {
                         "Some `do concurent` loops are not perfectly-nested. "
                         "These will be serialzied.");
 
-    // TODO This will be filled in with the next PRs that upstreams the rest of
-    // the ROCm implementaion.
+    mlir::IRMapping mapper;
+    genParallelOp(doLoop.getLoc(), rewriter, loopNest, mapper);
+    mlir::omp::LoopNestOperands loopNestClauseOps;
+    genLoopNestClauseOps(doLoop.getLoc(), rewriter, loopNest, mapper,
+                         loopNestClauseOps);
+
+    mlir::omp::LoopNestOp ompLoopNest =
+        genWsLoopOp(rewriter, loopNest.back().first, mapper, loopNestClauseOps,
+                    /*isComposite=*/mapToDevice);
+
+    rewriter.eraseOp(doLoop);
+
+    // Mark `unordered` loops that are not perfectly nested to be skipped from
+    // the legality check of the `ConversionTarget` since we are not interested
+    // in mapping them to OpenMP.
+    ompLoopNest->walk([&](fir::DoLoopOp doLoop) {
+      if (doLoop.getUnordered()) {
+        concurrentLoopsToSkip.insert(doLoop);
+      }
+    });
+
     return mlir::success();
   }
 
+private:
+  mlir::omp::ParallelOp genParallelOp(mlir::Location loc,
+                                      mlir::ConversionPatternRewriter &rewriter,
+                                      looputils::LoopNestToIndVarMap &loopNest,
+                                      mlir::IRMapping &mapper) const {
+    auto parallelOp = rewriter.create<mlir::omp::ParallelOp>(loc);
+    rewriter.createBlock(&parallelOp.getRegion());
+    rewriter.setInsertionPoint(rewriter.create<mlir::omp::TerminatorOp>(loc));
+
+    genLoopNestIndVarAllocs(rewriter, loopNest, mapper);
+    return parallelOp;
+  }
+
+  void genLoopNestIndVarAllocs(mlir::ConversionPatternRewriter &rewriter,
+                               looputils::LoopNestToIndVarMap &loopNest,
+                               mlir::IRMapping &mapper) const {
+
+    for (auto &[_, indVarInfo] : loopNest)
+      genInductionVariableAlloc(rewriter, indVarInfo.iterVarMemDef, mapper);
+  }
+
+  mlir::Operation *
+  genInductionVariableAlloc(mlir::ConversionPatternRewriter &rewriter,
+                            mlir::Operation *indVarMemDef,
+                            mlir::IRMapping &mapper) const {
+    assert(
+        indVarMemDef != nullptr &&
+        "Induction variable memdef is expected to have a defining operation.");
+
+    llvm::SmallSetVector<mlir::Operation *, 2> indVarDeclareAndAlloc;
+    for (auto operand : indVarMemDef->getOperands())
+      indVarDeclareAndAlloc.insert(operand.getDefiningOp());
+    indVarDeclareAndAlloc.insert(indVarMemDef);
+
+    mlir::Operation *result;
+    for (mlir::Operation *opToClone : indVarDeclareAndAlloc)
+      result = rewriter.clone(*opToClone, mapper);
+
+    return result;
+  }
+
+  void genLoopNestClauseOps(
+      mlir::Location loc, mlir::ConversionPatternRewriter &rewriter,
+      looputils::LoopNestToIndVarMap &loopNest, mlir::IRMapping &mapper,
+      mlir::omp::LoopNestOperands &loopNestClauseOps) const {
+    assert(loopNestClauseOps.loopLowerBounds.empty() &&
+           "Loop nest bounds were already emitted!");
+
+    auto populateBounds = [&](mlir::Value var,
+                              llvm::SmallVectorImpl<mlir::Value> &bounds) {
+      bounds.push_back(var.getDefiningOp()->getResult(0));
+    };
+
+    for (auto &[doLoop, _] : loopNest) {
+      populateBounds(doLoop.getLowerBound(), loopNestClauseOps.loopLowerBounds);
+      populateBounds(doLoop.getUpperBound(), loopNestClauseOps.loopUpperBounds);
+      populateBounds(doLoop.getStep(), loopNestClauseOps.loopSteps);
+    }
+
+    loopNestClauseOps.loopInclusive = rewriter.getUnitAttr();
+  }
+
+  mlir::omp::LoopNestOp
+  genWsLoopOp(mlir::ConversionPatternRewriter &rewriter, fir::DoLoopOp doLoop,
+              mlir::IRMapping &mapper,
+              const mlir::omp::LoopNestOperands &clauseOps,
+              bool isComposite) const {
+
+    auto wsloopOp = rewriter.create<mlir::omp::WsloopOp>(doLoop.getLoc());
+    wsloopOp.setComposite(isComposite);
+    rewriter.createBlock(&wsloopOp.getRegion());
+
+    auto loopNestOp =
+        rewriter.create<mlir::omp::LoopNestOp>(doLoop.getLoc(), clauseOps);
+
+    // Clone the loop's body inside the loop nest construct using the
+    // mapped values.
+    rewriter.cloneRegionBefore(doLoop.getRegion(), loopNestOp.getRegion(),
+                               loopNestOp.getRegion().begin(), mapper);
+
+    mlir::Operation *terminator = loopNestOp.getRegion().back().getTerminator();
+    rewriter.setInsertionPointToEnd(&loopNestOp.getRegion().back());
+    rewriter.create<mlir::omp::YieldOp>(terminator->getLoc());
+    rewriter.eraseOp(terminator);
+
+    return loopNestOp;
+  }
+
   bool mapToDevice;
+  llvm::DenseSet<fir::DoLoopOp> &concurrentLoopsToSkip;
 };
 
 class DoConcurrentConversionPass
@@ -175,16 +409,18 @@ class DoConcurrentConversionPass
       return;
     }
 
+    llvm::DenseSet<fir::DoLoopOp> concurrentLoopsToSkip;
     mlir::RewritePatternSet patterns(context);
     patterns.insert<DoConcurrentConversion>(
-        context, mapTo == flangomp::DoConcurrentMappingKind::DCMK_Device);
+        context, mapTo == flangomp::DoConcurrentMappingKind::DCMK_Device,
+        concurrentLoopsToSkip);
     mlir::ConversionTarget target(*context);
     target.addDynamicallyLegalOp<fir::DoLoopOp>([&](fir::DoLoopOp op) {
       // The goal is to handle constructs that eventually get lowered to
       // `fir.do_loop` with the `unordered` attribute (e.g. array expressions).
       // Currently, this is only enabled for the `do concurrent` construct since
       // the pass runs early in the pipeline.
-      return !op.getUnordered();
+      return !op.getUnordered() || concurrentLoopsToSkip.contains(op);
     });
     target.markUnknownOpDynamicallyLegal(
         [](mlir::Operation *) { return true; });
diff --git a/flang/test/Transforms/DoConcurrent/basic_host.f90 b/flang/test/Transforms/DoConcurrent/basic_host.f90
index b569668ab0f0e..b80bc2f2dd038 100644
--- a/flang/test/Transforms/DoConcurrent/basic_host.f90
+++ b/flang/test/Transforms/DoConcurrent/basic_host.f90
@@ -1,7 +1,3 @@
-! Mark as xfail for now until we upstream the relevant part. This is just for
-! demo purposes at this point. Upstreaming this is the next step.
-! XFAIL: *
-
 ! Tests mapping of a basic `do concurrent` loop to `!$omp parallel do`.
 
 ! RUN: %flang_fc1 -emit-hlfir -fopenmp -fdo-concurrent-to-openmp=host %s -o - \
@@ -19,17 +15,17 @@ program do_concurrent_basic
 
     ! CHECK-NOT: fir.do_loop
 
-    ! CHECK: omp.parallel {
-
-    ! CHECK-NEXT: %[[ITER_VAR:.*]] = fir.alloca i32 {bindc_name = "i"}
-    ! CHECK-NEXT: %[[BINDING:.*]]:2 = hlfir.declare %[[ITER_VAR]] {uniq_name = "_QFEi"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
-
     ! CHECK: %[[C1:.*]] = arith.constant 1 : i32
     ! CHECK: %[[LB:.*]] = fir.convert %[[C1]] : (i32) -> index
     ! CHECK: %[[C10:.*]] = arith.constant 10 : i32
     ! CHECK: %[[UB:.*]] = fir.convert %[[C10]] : (i32) -> index
     ! CHECK: %[[STEP:.*]] = arith.constant 1 : index
 
+    ! CHECK: omp.parallel {
+
+    ! CHECK-NEXT: %[[ITER_VAR:.*]] = fir.alloca i32 {bindc_name = "i"}
+    ! CHECK-NEXT: %[[BINDING:.*]]:2 = hlfir.declare %[[ITER_VAR]] {uniq_name = "_QFEi"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+
     ! CHECK: omp.wsloop {
     ! CHECK-NEXT: omp.loop_nest (%[[ARG0:.*]]) : index = (%[[LB]]) to (%[[UB]]) inclusive step (%[[STEP]]) {
     ! CHECK-NEXT: %[[IV_IDX:.*]] = fir.convert %[[ARG0]] : (index) -> i32
diff --git a/flang/test/Transforms/DoConcurrent/basic_host.mlir b/flang/test/Transforms/DoConcurrent/basic_host.mlir
new file mode 100644
index 0000000000000..b53ecd687c039
--- /dev/null
+++ b/flang/test/Transforms/DoConcurrent/basic_host.mlir
@@ -0,0 +1,62 @@
+// Tests mapping of a basic `do concurrent` loop to `!$omp parallel do`.
+
+// RUN: fir-opt --omp-do-concurrent-conversion="map-to=host" %s | FileCheck %s
+
+// CHECK-LABEL: func.func @do_concurrent_basic
+func.func @do_concurrent_basic() attributes {fir.bindc_name = "do_concurrent_basic"} {
+    // CHECK: %[[ARR:.*]]:2 = hlfir.declare %{{.*}}(%{{.*}}) {uniq_name = "_QFEa"} : (!fir.ref<!fir.array<10xi32>>, !fir.shape<1>) -> (!fir.ref<!fir.array<10xi32>>, !fir.ref<!fir.array<10xi32>>)
+
+    %0 = fir.alloca i32 {bindc_name = "i"}
+    %1:2 = hlfir.declare %0 {uniq_name = "_QFEi"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+    %2 = fir.address_of(@_QFEa) : !fir.ref<!fir.array<10xi32>>
+    %c10 = arith.constant 10 : index
+    %3 = fir.shape %c10 : (index) -> !fir.shape<1>
+    %4:2 = hlfir.declare %2(%3) {uniq_name = "_QFEa"} : (!fir.ref<!fir.array<10xi32>>, !fir.shape<1>) -> (!fir.ref<!fir.array<10xi32>>, !fir.ref<!fir.array<10xi32>>)
+    %c1_i32 = arith.constant 1 : i32
+    %7 = fir.convert %c1_i32 : (i32) -> index
+    %c10_i32 = arith.constant 10 : i32
+    %8 = fir.convert %c10_i32 : (i32) -> index
+    %c1 = arith.constant 1 : index
+
+    // CHECK-NOT: fir.do_loop
+
+    // CHECK: %[[C1:.*]] = arith.constant 1 : i32
+    // CHECK: %[[LB:.*]] = fir.convert %[[C1]] : (i32) -> index
+    // CHECK: %[[C10:.*]] = arith.constant 10 : i32
+    // CHECK: %[[UB:.*]] = fir.convert %[[C10]] : (i32) -> index
+    // CHECK: %[[STEP:.*]] = arith.constant 1 : index
+
+    // CHECK: omp.parallel {
+
+    // CHECK-NEXT: %[[ITER_VAR:.*]] = fir.alloca i32 {bindc_name = "i"}
+    // CHECK-NEXT: %[[BINDING:.*]]:2 = hlfir.declare %[[ITER_VAR]] {uniq_name = "_QFEi"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+
+    // CHECK: omp.wsloop {
+    // CHECK-NEXT: omp.loop_nest (%[[ARG0:.*]]) : index = (%[[LB]]) to (%[[UB]]) inclusive step (%[[STEP]]) {
+    // CHECK-NEXT: %[[IV_IDX:.*]] = fir.convert %[[ARG0]] : (index) -> i32
+    // CHECK-NEXT: fir.store %[[IV_IDX]] to %[[BINDING]]#1 : !fir.ref<i32>
+    // CHECK-NEXT: %[[IV_VAL1:.*]] = fir.load %[[BINDING]]#0 : !fir.ref<i32>
+    // CHECK-NEXT: %[[IV_VAL2:.*]] = fir.load %[[BINDING]]#0 : !fir.ref<i32>
+    // CHECK-NEXT: %[[IV_VAL_I64:.*]] = fir.convert %[[IV_VAL2]] : (i32) -> i64
+    // CHECK-NEXT: %[[ARR_ACCESS:.*]] = hlfir.designate %[[ARR]]#0 (%[[IV_VAL_I64]])  : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
+    // CHECK-NEXT: hlfir.assign %[[IV_VAL1]] to %[[ARR_ACCESS]] : i32, !fir.ref<i32>
+    // CHECK-NEXT: omp.yield
+    // CHECK-NEXT: }
+    // CHECK-NEXT: }
+
+    // CHECK-NEXT: omp.terminator
+    // CHECK-NEXT: }
+    fir.do_loop %arg0 = %7 to %8 step %c1 unordered {
+      %13 = fir.convert %arg0 : (index) -> i32
+      fir.store %13 to %1#1 : !fir.ref<i32>
+      %14 = fir.load %1#0 : !fir.ref<i32>
+      %15 = fir.load %1#0 : !fir.ref<i32>
+      %16 = fir.convert %15 : (i32) -> i64
+      %17 = hlfir.designate %4#0 (%16)  : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
+      hlfir.assign %14 to %17 : i32, !fir.ref<i32>
+    }
+
+    // CHECK-NOT: fir.do_loop
+
+    return
+  }
diff --git a/flang/test/Transforms/DoConcurrent/non_const_bounds.f90 b/flang/test/Transforms/DoConcurrent/non_const_bounds.f90
new file mode 100644
index 0000000000000..48e0afe6752b6
--- /dev/null
+++ b/flang/test/Transforms/DoConcurrent/non_const_bounds.f90
@@ -0,0 +1,45 @@
+! RUN: %flang_fc1 -emit-hlfir -fopenmp -fdo-concurrent-to-openmp=host %s -o - \
+! RUN:   | FileCheck %s
+
+program main
+    implicit none
+
+    call foo(10)
+
+    contains
+        subroutine foo(n)
+            implicit none
+            integer :: n
+            integer :: i
+            integer, dimension(n) :: a
+
+            do concurrent(i=1:n)
+                a(i) = i
+            end do
+        end subroutine
+
+end program main
+
+! CHECK: %[[N_DECL:.*]]:2 = hlfir.declare %{{.*}} dummy_scope %{{.*}} {uniq_name = "_QFFfooEn"}
+
+! CHECK: fir.load
+
+! CHECK: %[[LB:.*]] = fir.convert %{{c1_.*}} : (i32) -> index
+! CHECK: %[[N_VAL:.*]] = fir.load %[[N_DECL]]#0 : !fir.ref<i32>
+! CHECK: %[[UB:.*]] = fir.convert %[[N_VAL]] : (i32) -> index
+! CHECK: %[[C1:.*]] = arith.constant 1 : index
+
+! CHECK: omp.parallel {
+
+
+! Verify that we restort to using the outside val...
[truncated]

skatrak

Thank you, a couple of comments from me. But the general approach looks good.

flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp

skatrak · 2025-02-20T13:31:38Z

flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp

+/// `%ind_var_conv` -> `%ind_var` conversion sequence.
+///
+/// For why this is useful, see its use in `findLoopIndVarMemDecl`.
+bool isIndVarUltimateOperand(mlir::Operation *op, fir::DoLoopOp doLoop) {


I'm not too sure about this function. What it actually checks is that either the induction variable of doLoop is one of the operands of op, that the operation defining specifically the first operand of op has the induction variable as one of its operands, that the operation defining its own first operand has it, etc.

Rather than checking whether the induction variable is ultimately one of its operands, it seems to check that one of its operands is or has been obtained from (among potentially other values) the induction variable, with the seemingly arbitrary condition that, if it's obtained indirectly, it must come from a chain of "first-operands".

If I understand it correctly, the first example would meet that condition, whereas the second wouldn't. I think that these should both be treated the same:

fir.do_loop %ind_var = ... unordered { %ind_var_conv = fir.convert %ind_var : (index) -> i32 // isIndVarUltimateOperand() returns true %add = arith.addi %ind_var_conv, %x : i32 ... } fir.do_loop %ind_var = ... unordered { %ind_var_conv = fir.convert %ind_var : (index) -> i32 // isIndVarUltimateOperand() returns false %add = arith.addi %x, %ind_var_conv : i32 ... }

If this is intended to handle fir.convert and fir.store, I think it makes more sense to actually check for these specific operation types and usage patterns rather than creating a pseudo-generic check that can easily break. I don't think the solution is to check all operands, because at that point we're just checking whether the induction variable participated in the expression that was passed in, which doesn't seem to be what we want here.

Simplified the logic to match what flang currently emits.

flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp

ergawy · 2025-03-04T13:17:10Z

This PR is ready for another round of reviews. Please take a look when you have time 🙏

ergawy · 2025-03-06T16:17:33Z

Ping! Please have a look when you have time.

ergawy · 2025-03-21T05:57:46Z

The same issue (requesting reviews from an enormous amount of people) happened here as well. No idea why. I just did a simple rebase on main. Apologies for the noise once more.

…126026) This PR starts the effort to upstream AMD's internal implementation of `do concurrent` to OpenMP mapping. This replaces llvm#77285 since we extended this WIP quite a bit on our fork over the past year. An important part of this PR is a document that describes the current status downstream, the upstreaming status, and next steps to make this pass much more useful. In addition to this document, this PR also contains the skeleton of the pass (no useful transformations are done yet) and some testing for the added command line options. This looks like a huge PR but a lot of the added stuff is documentation. It is also worth noting that the downstream pass has been validated on https://github.com/BerkeleyLab/fiats. For the CPU mapping, this achived performance speed-ups that match pure OpenMP, for GPU mapping we are still working on extending our support for implicit memory mapping and locality specifiers. PR stack: - llvm#126026 (this PR) - llvm#127595 - llvm#127633 - llvm#127634 - llvm#127635

…27595) Upstreams the next part of do concurrent to OpenMP mapping pass (from AMD's ROCm implementation). See llvm#126026 for more context. This PR add loop nest detection logic. This enables us to discover muli-range do concurrent loops and then map them as "collapsed" loop nests to OpenMP. This is a follow up for llvm#126026, only the latest commit is relevant. This is a replacement for llvm#127478 using a `/user/<username>/<branchname>` branch. PR stack: - llvm#126026 - llvm#127595 (this PR) - llvm#127633 - llvm#127634 - llvm#127635

…ructs (llvm#127633) Upstreams one more part of the ROCm `do concurrent` to OpenMP mapping pass. This PR add support for converting simple loops to the equivalent OpenMP constructs on the host: `omp parallel do`. Towards that end, we have to collect more information about loop nests for which we add new utils in the `looputils` name space. PR stack: - llvm#126026 - llvm#127595 - llvm#127633 (this PR) - llvm#127634 - llvm#127635

…lvm#127634) Adds support for converting mulit-range loops to OpenMP (on the host only for now). The changes here "prepare" a loop nest for collapsing by sinking iteration variables to the innermost `fir.do_loop` op in the nest. PR stack: - llvm#126026 - llvm#127595 - llvm#127633 - llvm#127634 (this PR) - llvm#127635

…lvm#127635) Extends `do concurrent` mapping to handle "loop-local values". A loop-local value is one that is used exclusively inside the loop but allocated outside of it. This usually corresponds to temporary values that are used inside the loop body for initialzing other variables for example. After collecting these values, the pass localizes them to the loop nest by moving their allocations. PR stack: - llvm#126026 - llvm#127595 - llvm#127633 - llvm#127634 - llvm#127635 (this PR)

This PR starts the effort to upstream AMD's internal implementation of `do concurrent` to OpenMP mapping. This replaces #77285 since we extended this WIP quite a bit on our fork over the past year. An important part of this PR is a document that describes the current status downstream, the upstreaming status, and next steps to make this pass much more useful. In addition to this document, this PR also contains the skeleton of the pass (no useful transformations are done yet) and some testing for the added command line options. This looks like a huge PR but a lot of the added stuff is documentation. It is also worth noting that the downstream pass has been validated on https://github.com/BerkeleyLab/fiats. For the CPU mapping, this achived performance speed-ups that match pure OpenMP, for GPU mapping we are still working on extending our support for implicit memory mapping and locality specifiers. PR stack: - #126026 (this PR) - #127595 - #127633 - #127634 - #127635

…ping (#126026) This PR starts the effort to upstream AMD's internal implementation of `do concurrent` to OpenMP mapping. This replaces #77285 since we extended this WIP quite a bit on our fork over the past year. An important part of this PR is a document that describes the current status downstream, the upstreaming status, and next steps to make this pass much more useful. In addition to this document, this PR also contains the skeleton of the pass (no useful transformations are done yet) and some testing for the added command line options. This looks like a huge PR but a lot of the added stuff is documentation. It is also worth noting that the downstream pass has been validated on https://github.com/BerkeleyLab/fiats. For the CPU mapping, this achived performance speed-ups that match pure OpenMP, for GPU mapping we are still working on extending our support for implicit memory mapping and locality specifiers. PR stack: - llvm/llvm-project#126026 (this PR) - llvm/llvm-project#127595 - llvm/llvm-project#127633 - llvm/llvm-project#127634 - llvm/llvm-project#127635

Upstreams the next part of do concurrent to OpenMP mapping pass (from AMD's ROCm implementation). See #126026 for more context. This PR add loop nest detection logic. This enables us to discover muli-range do concurrent loops and then map them as "collapsed" loop nests to OpenMP. This is a follow up for #126026, only the latest commit is relevant. This is a replacement for #127478 using a `/user/<username>/<branchname>` branch. PR stack: - #126026 - #127595 (this PR) - #127633 - #127634 - #127635

…ructs Upstreams one more part of the ROCm `do concurrent` to OpenMP mapping pass. This PR add support for converting simple loops to the equivalent OpenMP constructs on the host: `omp parallel do`. Towards that end, we have to collect more information about loop nests for which we add new utils in the `looputils` name space.

…on. (#127595) Upstreams the next part of do concurrent to OpenMP mapping pass (from AMD's ROCm implementation). See llvm/llvm-project#126026 for more context. This PR add loop nest detection logic. This enables us to discover muli-range do concurrent loops and then map them as "collapsed" loop nests to OpenMP. This is a follow up for llvm/llvm-project#126026, only the latest commit is relevant. This is a replacement for llvm/llvm-project#127478 using a `/user/<username>/<branchname>` branch. PR stack: - llvm/llvm-project#126026 - llvm/llvm-project#127595 (this PR) - llvm/llvm-project#127633 - llvm/llvm-project#127634 - llvm/llvm-project#127635

… host constructs (#127633) Upstreams one more part of the ROCm `do concurrent` to OpenMP mapping pass. This PR add support for converting simple loops to the equivalent OpenMP constructs on the host: `omp parallel do`. Towards that end, we have to collect more information about loop nests for which we add new utils in the `looputils` name space. PR stack: - llvm/llvm-project#126026 - llvm/llvm-project#127595 - llvm/llvm-project#127633 (this PR) - llvm/llvm-project#127634 - llvm/llvm-project#127635

…127634) Adds support for converting mulit-range loops to OpenMP (on the host only for now). The changes here "prepare" a loop nest for collapsing by sinking iteration variables to the innermost `fir.do_loop` op in the nest. PR stack: - #126026 - #127595 - #127633 - #127634 (this PR) - #127635

…nge loops (#127634) Adds support for converting mulit-range loops to OpenMP (on the host only for now). The changes here "prepare" a loop nest for collapsing by sinking iteration variables to the innermost `fir.do_loop` op in the nest. PR stack: - llvm/llvm-project#126026 - llvm/llvm-project#127595 - llvm/llvm-project#127633 - llvm/llvm-project#127634 (this PR) - llvm/llvm-project#127635

…127635) Extends `do concurrent` mapping to handle "loop-local values". A loop-local value is one that is used exclusively inside the loop but allocated outside of it. This usually corresponds to temporary values that are used inside the loop body for initialzing other variables for example. After collecting these values, the pass localizes them to the loop nest by moving their allocations. PR stack: - #126026 - #127595 - #127633 - #127634 - #127635 (this PR)

…nt` nests (#127635) Extends `do concurrent` mapping to handle "loop-local values". A loop-local value is one that is used exclusively inside the loop but allocated outside of it. This usually corresponds to temporary values that are used inside the loop body for initialzing other variables for example. After collecting these values, the pass localizes them to the loop nest by moving their allocations. PR stack: - llvm/llvm-project#126026 - llvm/llvm-project#127595 - llvm/llvm-project#127633 - llvm/llvm-project#127634 - llvm/llvm-project#127635 (this PR)

ergawy changed the title ~~[flang][OpenMP] Map simple do concurrent loops to OpenMP host const…~~ [flang][OpenMP] Map simple do concurrent loops to OpenMP host constructs Feb 18, 2025

llvmbot added flang Flang issues not falling into any other category flang:fir-hlfir labels Feb 18, 2025

ergawy mentioned this pull request Feb 18, 2025

[flang][OpenMP] Extend do concurrent mapping to multi-range loops #127634

Merged

ergawy force-pushed the users/ergawy/upstream_do_concurrent_3_basic_host_support branch from 0ecf2e2 to 06bf9bc Compare February 18, 2025 14:18

ergawy requested review from clementval, skatrak, kiranchandramohan, tarunprabhu, mjklemm, jeanPerier, rouson, dpalermo and klausler February 18, 2025 15:38

ergawy force-pushed the users/ergawy/upstream_do_concurrent_2 branch from 41f77da to 2561e21 Compare February 19, 2025 12:34

skatrak reviewed Feb 20, 2025

View reviewed changes

ergawy force-pushed the users/ergawy/upstream_do_concurrent_2 branch from 2561e21 to 6ef2fae Compare February 20, 2025 15:34

ergawy force-pushed the users/ergawy/upstream_do_concurrent_3_basic_host_support branch from 06bf9bc to a615d77 Compare February 20, 2025 15:46

skatrak reviewed Feb 21, 2025

View reviewed changes

flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp Show resolved Hide resolved

ergawy force-pushed the users/ergawy/upstream_do_concurrent_2 branch 2 times, most recently from 92d976d to ca40210 Compare March 4, 2025 07:18

ergawy force-pushed the users/ergawy/upstream_do_concurrent_3_basic_host_support branch from a615d77 to d2e3c77 Compare March 4, 2025 07:27

This was referenced Mar 4, 2025

[flang][OpenMP] Upstream first part of do concurrent mapping #126026

Merged

[flang][OpenMP] Upstream do concurrent loop-nest detection. #127595

Merged

[flang][OpenMP] Handle "loop-local values" in do concurrent nests #127635

Merged

klausler removed their request for review March 6, 2025 16:27

ergawy force-pushed the users/ergawy/upstream_do_concurrent_2 branch from ca40210 to 9339d07 Compare March 10, 2025 04:52

ergawy removed request for JDevlieghere, lanza, bcardosolopes, Endilll, nikic and a team March 21, 2025 05:54

ergawy force-pushed the users/ergawy/upstream_do_concurrent_2 branch from 2c3ed0e to e15842d Compare April 2, 2025 07:26

Base automatically changed from users/ergawy/upstream_do_concurrent_2 to main April 2, 2025 08:12

ergawy added 3 commits April 2, 2025 03:14

Handle review comments

2720590

handle review comments

0243c4f

ergawy force-pushed the users/ergawy/upstream_do_concurrent_3_basic_host_support branch from 25b36c6 to 0243c4f Compare April 2, 2025 08:14

fix test

ded9b54

ergawy merged commit 3f8bfc9 into main Apr 2, 2025
12 checks passed

ergawy deleted the users/ergawy/upstream_do_concurrent_3_basic_host_support branch April 2, 2025 09:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[flang][OpenMP] Map simple `do concurrent` loops to OpenMP host constructs #127633

[flang][OpenMP] Map simple `do concurrent` loops to OpenMP host constructs #127633

ergawy commented Feb 18, 2025 •

edited

Loading

llvmbot commented Feb 18, 2025

skatrak left a comment

skatrak Feb 20, 2025

ergawy Mar 4, 2025

ergawy commented Mar 4, 2025

ergawy commented Mar 6, 2025

ergawy commented Mar 21, 2025 •

edited

Loading

[flang][OpenMP] Map simple do concurrent loops to OpenMP host constructs #127633

[flang][OpenMP] Map simple do concurrent loops to OpenMP host constructs #127633

Conversation

ergawy commented Feb 18, 2025 • edited Loading

llvmbot commented Feb 18, 2025

skatrak left a comment

Choose a reason for hiding this comment

skatrak Feb 20, 2025

Choose a reason for hiding this comment

ergawy Mar 4, 2025

Choose a reason for hiding this comment

ergawy commented Mar 4, 2025

ergawy commented Mar 6, 2025

ergawy commented Mar 21, 2025 • edited Loading

[flang][OpenMP] Map simple `do concurrent` loops to OpenMP host constructs #127633

[flang][OpenMP] Map simple `do concurrent` loops to OpenMP host constructs #127633

ergawy commented Feb 18, 2025 •

edited

Loading

ergawy commented Mar 21, 2025 •

edited

Loading