-
Notifications
You must be signed in to change notification settings - Fork 13.3k
[flang][OpenMP] Map simple do concurrent
loops to OpenMP host constructs
#127633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[flang][OpenMP] Map simple do concurrent
loops to OpenMP host constructs
#127633
Conversation
do concurrent
loops to OpenMP host const…do concurrent
loops to OpenMP host constructs
@llvm/pr-subscribers-flang-fir-hlfir Author: Kareem Ergawy (ergawy) ChangesUpstreams one more part of the ROCm Patch is 21.94 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/127633.diff 6 Files Affected:
diff --git a/flang/docs/DoConcurrentConversionToOpenMP.md b/flang/docs/DoConcurrentConversionToOpenMP.md
index de2525dd8b57d..914ace0813f0e 100644
--- a/flang/docs/DoConcurrentConversionToOpenMP.md
+++ b/flang/docs/DoConcurrentConversionToOpenMP.md
@@ -126,6 +126,53 @@ see the "Data environment" section below.
See `flang/test/Transforms/DoConcurrent/loop_nest_test.f90` for more examples
of what is and is not detected as a perfect loop nest.
+### Single-range loops
+
+Given the following loop:
+```fortran
+ do concurrent(i=1:n)
+ a(i) = i * i
+ end do
+```
+
+#### Mapping to `host`
+
+Mapping this loop to the `host`, generates MLIR operations of the following
+structure:
+
+```
+%4 = fir.address_of(@_QFEa) ...
+%6:2 = hlfir.declare %4 ...
+
+omp.parallel {
+ // Allocate private copy for `i`.
+ // TODO Use delayed privatization.
+ %19 = fir.alloca i32 {bindc_name = "i"}
+ %20:2 = hlfir.declare %19 {uniq_name = "_QFEi"} ...
+
+ omp.wsloop {
+ omp.loop_nest (%arg0) : index = (%21) to (%22) inclusive step (%c1_2) {
+ %23 = fir.convert %arg0 : (index) -> i32
+ // Use the privatized version of `i`.
+ fir.store %23 to %20#1 : !fir.ref<i32>
+ ...
+
+ // Use "shared" SSA value of `a`.
+ %42 = hlfir.designate %6#0
+ hlfir.assign %35 to %42
+ ...
+ omp.yield
+ }
+ omp.terminator
+ }
+ omp.terminator
+}
+```
+
+#### Mapping to `device`
+
+<!-- TODO -->
+
<!--
More details about current status will be added along with relevant parts of the
implementation in later upstreaming patches.
diff --git a/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp b/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp
index 19048a03e254e..dc797877ac87b 100644
--- a/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp
+++ b/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp
@@ -11,6 +11,7 @@
#include "flang/Optimizer/OpenMP/Utils.h"
#include "mlir/Analysis/SliceAnalysis.h"
#include "mlir/Dialect/OpenMP/OpenMPDialect.h"
+#include "mlir/IR/IRMapping.h"
#include "mlir/Transforms/DialectConversion.h"
#include "mlir/Transforms/RegionUtils.h"
@@ -24,8 +25,126 @@ namespace flangomp {
namespace {
namespace looputils {
-using LoopNest = llvm::SetVector<fir::DoLoopOp>;
+/// Stores info needed about the induction/iteration variable for each `do
+/// concurrent` in a loop nest. This includes:
+/// * the operation allocating memory for iteration variable,
+/// * the operation(s) updating the iteration variable with the current
+/// iteration number.
+struct InductionVariableInfo {
+ mlir::Operation *iterVarMemDef;
+ llvm::SetVector<mlir::Operation *> indVarUpdateOps;
+};
+
+using LoopNestToIndVarMap =
+ llvm::MapVector<fir::DoLoopOp, InductionVariableInfo>;
+
+/// Given an operation `op`, this returns true if one of `op`'s operands is
+/// "ultimately" the loop's induction variable. This helps in cases where the
+/// induction variable's use is "hidden" behind a convert/cast.
+///
+/// For example, give the following loop:
+/// ```
+/// fir.do_loop %ind_var = %lb to %ub step %s unordered {
+/// %ind_var_conv = fir.convert %ind_var : (index) -> i32
+/// fir.store %ind_var_conv to %i#1 : !fir.ref<i32>
+/// ...
+/// }
+/// ```
+///
+/// If \p op is the `fir.store` operation, then this function will return true
+/// since the IV is the "ultimate" opeerand to the `fir.store` op through the
+/// `%ind_var_conv` -> `%ind_var` conversion sequence.
+///
+/// For why this is useful, see its use in `findLoopIndVarMemDecl`.
+bool isIndVarUltimateOperand(mlir::Operation *op, fir::DoLoopOp doLoop) {
+ while (op != nullptr && op->getNumOperands() > 0) {
+ auto ivIt = llvm::find_if(op->getOperands(), [&](mlir::Value operand) {
+ return operand == doLoop.getInductionVar();
+ });
+
+ if (ivIt != op->getOperands().end())
+ return true;
+
+ op = op->getOperand(0).getDefiningOp();
+ }
+
+ return false;
+}
+
+/// For the \p doLoop parameter, find the operation that declares its iteration
+/// variable or allocates memory for it.
+///
+/// For example, give the following loop:
+/// ```
+/// ...
+/// %i:2 = hlfir.declare %0 {uniq_name = "_QFEi"} : ...
+/// ...
+/// fir.do_loop %ind_var = %lb to %ub step %s unordered {
+/// %ind_var_conv = fir.convert %ind_var : (index) -> i32
+/// fir.store %ind_var_conv to %i#1 : !fir.ref<i32>
+/// ...
+/// }
+/// ```
+///
+/// This function returns the `hlfir.declare` op for `%i`.
+mlir::Operation *findLoopIterationVarMemDecl(fir::DoLoopOp doLoop) {
+ mlir::Value result = nullptr;
+ mlir::visitUsedValuesDefinedAbove(
+ doLoop.getRegion(), [&](mlir::OpOperand *operand) {
+ if (result)
+ return;
+
+ if (isIndVarUltimateOperand(operand->getOwner(), doLoop)) {
+ assert(result == nullptr &&
+ "loop can have only one induction variable");
+ result = operand->get();
+ }
+ });
+
+ assert(result != nullptr && result.getDefiningOp() != nullptr);
+ return result.getDefiningOp();
+}
+/// Collects the op(s) responsible for updating a loop's iteration variable with
+/// the current iteration number. For example, for the input IR:
+/// ```
+/// %i = fir.alloca i32 {bindc_name = "i"}
+/// %i_decl:2 = hlfir.declare %i ...
+/// ...
+/// fir.do_loop %i_iv = %lb to %ub step %step unordered {
+/// %1 = fir.convert %i_iv : (index) -> i32
+/// fir.store %1 to %i_decl#1 : !fir.ref<i32>
+/// ...
+/// }
+/// ```
+/// this function would return the first 2 ops in the `fir.do_loop`'s region.
+llvm::SetVector<mlir::Operation *>
+extractIndVarUpdateOps(fir::DoLoopOp doLoop) {
+ mlir::Value indVar = doLoop.getInductionVar();
+ llvm::SetVector<mlir::Operation *> indVarUpdateOps;
+
+ llvm::SmallVector<mlir::Value> toProcess;
+ toProcess.push_back(indVar);
+
+ llvm::DenseSet<mlir::Value> done;
+
+ while (!toProcess.empty()) {
+ mlir::Value val = toProcess.back();
+ toProcess.pop_back();
+
+ if (!done.insert(val).second)
+ continue;
+
+ for (mlir::Operation *user : val.getUsers()) {
+ indVarUpdateOps.insert(user);
+
+ for (mlir::Value result : user->getResults())
+ toProcess.push_back(result);
+ }
+ }
+
+ return std::move(indVarUpdateOps);
+}
/// Loop \p innerLoop is considered perfectly-nested inside \p outerLoop iff
/// there are no operations in \p outerloop's body other than:
///
@@ -93,11 +212,16 @@ bool isPerfectlyNested(fir::DoLoopOp outerLoop, fir::DoLoopOp innerLoop) {
/// recognize a certain nested loop as part of the nest it just returns the
/// parent loops it discovered before.
mlir::LogicalResult collectLoopNest(fir::DoLoopOp currentLoop,
- LoopNest &loopNest) {
+ LoopNestToIndVarMap &loopNest) {
assert(currentLoop.getUnordered());
while (true) {
- loopNest.insert(currentLoop);
+ loopNest.try_emplace(
+ currentLoop,
+ InductionVariableInfo{
+ findLoopIterationVarMemDecl(currentLoop),
+ std::move(looputils::extractIndVarUpdateOps(currentLoop))});
+
auto directlyNestedLoops = currentLoop.getRegion().getOps<fir::DoLoopOp>();
llvm::SmallVector<fir::DoLoopOp> unorderedLoops;
@@ -127,13 +251,15 @@ class DoConcurrentConversion : public mlir::OpConversionPattern<fir::DoLoopOp> {
public:
using mlir::OpConversionPattern<fir::DoLoopOp>::OpConversionPattern;
- DoConcurrentConversion(mlir::MLIRContext *context, bool mapToDevice)
- : OpConversionPattern(context), mapToDevice(mapToDevice) {}
+ DoConcurrentConversion(mlir::MLIRContext *context, bool mapToDevice,
+ llvm::DenseSet<fir::DoLoopOp> &concurrentLoopsToSkip)
+ : OpConversionPattern(context), mapToDevice(mapToDevice),
+ concurrentLoopsToSkip(concurrentLoopsToSkip) {}
mlir::LogicalResult
matchAndRewrite(fir::DoLoopOp doLoop, OpAdaptor adaptor,
mlir::ConversionPatternRewriter &rewriter) const override {
- looputils::LoopNest loopNest;
+ looputils::LoopNestToIndVarMap loopNest;
bool hasRemainingNestedLoops =
failed(looputils::collectLoopNest(doLoop, loopNest));
if (hasRemainingNestedLoops)
@@ -141,12 +267,120 @@ class DoConcurrentConversion : public mlir::OpConversionPattern<fir::DoLoopOp> {
"Some `do concurent` loops are not perfectly-nested. "
"These will be serialzied.");
- // TODO This will be filled in with the next PRs that upstreams the rest of
- // the ROCm implementaion.
+ mlir::IRMapping mapper;
+ genParallelOp(doLoop.getLoc(), rewriter, loopNest, mapper);
+ mlir::omp::LoopNestOperands loopNestClauseOps;
+ genLoopNestClauseOps(doLoop.getLoc(), rewriter, loopNest, mapper,
+ loopNestClauseOps);
+
+ mlir::omp::LoopNestOp ompLoopNest =
+ genWsLoopOp(rewriter, loopNest.back().first, mapper, loopNestClauseOps,
+ /*isComposite=*/mapToDevice);
+
+ rewriter.eraseOp(doLoop);
+
+ // Mark `unordered` loops that are not perfectly nested to be skipped from
+ // the legality check of the `ConversionTarget` since we are not interested
+ // in mapping them to OpenMP.
+ ompLoopNest->walk([&](fir::DoLoopOp doLoop) {
+ if (doLoop.getUnordered()) {
+ concurrentLoopsToSkip.insert(doLoop);
+ }
+ });
+
return mlir::success();
}
+private:
+ mlir::omp::ParallelOp genParallelOp(mlir::Location loc,
+ mlir::ConversionPatternRewriter &rewriter,
+ looputils::LoopNestToIndVarMap &loopNest,
+ mlir::IRMapping &mapper) const {
+ auto parallelOp = rewriter.create<mlir::omp::ParallelOp>(loc);
+ rewriter.createBlock(¶llelOp.getRegion());
+ rewriter.setInsertionPoint(rewriter.create<mlir::omp::TerminatorOp>(loc));
+
+ genLoopNestIndVarAllocs(rewriter, loopNest, mapper);
+ return parallelOp;
+ }
+
+ void genLoopNestIndVarAllocs(mlir::ConversionPatternRewriter &rewriter,
+ looputils::LoopNestToIndVarMap &loopNest,
+ mlir::IRMapping &mapper) const {
+
+ for (auto &[_, indVarInfo] : loopNest)
+ genInductionVariableAlloc(rewriter, indVarInfo.iterVarMemDef, mapper);
+ }
+
+ mlir::Operation *
+ genInductionVariableAlloc(mlir::ConversionPatternRewriter &rewriter,
+ mlir::Operation *indVarMemDef,
+ mlir::IRMapping &mapper) const {
+ assert(
+ indVarMemDef != nullptr &&
+ "Induction variable memdef is expected to have a defining operation.");
+
+ llvm::SmallSetVector<mlir::Operation *, 2> indVarDeclareAndAlloc;
+ for (auto operand : indVarMemDef->getOperands())
+ indVarDeclareAndAlloc.insert(operand.getDefiningOp());
+ indVarDeclareAndAlloc.insert(indVarMemDef);
+
+ mlir::Operation *result;
+ for (mlir::Operation *opToClone : indVarDeclareAndAlloc)
+ result = rewriter.clone(*opToClone, mapper);
+
+ return result;
+ }
+
+ void genLoopNestClauseOps(
+ mlir::Location loc, mlir::ConversionPatternRewriter &rewriter,
+ looputils::LoopNestToIndVarMap &loopNest, mlir::IRMapping &mapper,
+ mlir::omp::LoopNestOperands &loopNestClauseOps) const {
+ assert(loopNestClauseOps.loopLowerBounds.empty() &&
+ "Loop nest bounds were already emitted!");
+
+ auto populateBounds = [&](mlir::Value var,
+ llvm::SmallVectorImpl<mlir::Value> &bounds) {
+ bounds.push_back(var.getDefiningOp()->getResult(0));
+ };
+
+ for (auto &[doLoop, _] : loopNest) {
+ populateBounds(doLoop.getLowerBound(), loopNestClauseOps.loopLowerBounds);
+ populateBounds(doLoop.getUpperBound(), loopNestClauseOps.loopUpperBounds);
+ populateBounds(doLoop.getStep(), loopNestClauseOps.loopSteps);
+ }
+
+ loopNestClauseOps.loopInclusive = rewriter.getUnitAttr();
+ }
+
+ mlir::omp::LoopNestOp
+ genWsLoopOp(mlir::ConversionPatternRewriter &rewriter, fir::DoLoopOp doLoop,
+ mlir::IRMapping &mapper,
+ const mlir::omp::LoopNestOperands &clauseOps,
+ bool isComposite) const {
+
+ auto wsloopOp = rewriter.create<mlir::omp::WsloopOp>(doLoop.getLoc());
+ wsloopOp.setComposite(isComposite);
+ rewriter.createBlock(&wsloopOp.getRegion());
+
+ auto loopNestOp =
+ rewriter.create<mlir::omp::LoopNestOp>(doLoop.getLoc(), clauseOps);
+
+ // Clone the loop's body inside the loop nest construct using the
+ // mapped values.
+ rewriter.cloneRegionBefore(doLoop.getRegion(), loopNestOp.getRegion(),
+ loopNestOp.getRegion().begin(), mapper);
+
+ mlir::Operation *terminator = loopNestOp.getRegion().back().getTerminator();
+ rewriter.setInsertionPointToEnd(&loopNestOp.getRegion().back());
+ rewriter.create<mlir::omp::YieldOp>(terminator->getLoc());
+ rewriter.eraseOp(terminator);
+
+ return loopNestOp;
+ }
+
bool mapToDevice;
+ llvm::DenseSet<fir::DoLoopOp> &concurrentLoopsToSkip;
};
class DoConcurrentConversionPass
@@ -175,16 +409,18 @@ class DoConcurrentConversionPass
return;
}
+ llvm::DenseSet<fir::DoLoopOp> concurrentLoopsToSkip;
mlir::RewritePatternSet patterns(context);
patterns.insert<DoConcurrentConversion>(
- context, mapTo == flangomp::DoConcurrentMappingKind::DCMK_Device);
+ context, mapTo == flangomp::DoConcurrentMappingKind::DCMK_Device,
+ concurrentLoopsToSkip);
mlir::ConversionTarget target(*context);
target.addDynamicallyLegalOp<fir::DoLoopOp>([&](fir::DoLoopOp op) {
// The goal is to handle constructs that eventually get lowered to
// `fir.do_loop` with the `unordered` attribute (e.g. array expressions).
// Currently, this is only enabled for the `do concurrent` construct since
// the pass runs early in the pipeline.
- return !op.getUnordered();
+ return !op.getUnordered() || concurrentLoopsToSkip.contains(op);
});
target.markUnknownOpDynamicallyLegal(
[](mlir::Operation *) { return true; });
diff --git a/flang/test/Transforms/DoConcurrent/basic_host.f90 b/flang/test/Transforms/DoConcurrent/basic_host.f90
index b569668ab0f0e..b80bc2f2dd038 100644
--- a/flang/test/Transforms/DoConcurrent/basic_host.f90
+++ b/flang/test/Transforms/DoConcurrent/basic_host.f90
@@ -1,7 +1,3 @@
-! Mark as xfail for now until we upstream the relevant part. This is just for
-! demo purposes at this point. Upstreaming this is the next step.
-! XFAIL: *
-
! Tests mapping of a basic `do concurrent` loop to `!$omp parallel do`.
! RUN: %flang_fc1 -emit-hlfir -fopenmp -fdo-concurrent-to-openmp=host %s -o - \
@@ -19,17 +15,17 @@ program do_concurrent_basic
! CHECK-NOT: fir.do_loop
- ! CHECK: omp.parallel {
-
- ! CHECK-NEXT: %[[ITER_VAR:.*]] = fir.alloca i32 {bindc_name = "i"}
- ! CHECK-NEXT: %[[BINDING:.*]]:2 = hlfir.declare %[[ITER_VAR]] {uniq_name = "_QFEi"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
-
! CHECK: %[[C1:.*]] = arith.constant 1 : i32
! CHECK: %[[LB:.*]] = fir.convert %[[C1]] : (i32) -> index
! CHECK: %[[C10:.*]] = arith.constant 10 : i32
! CHECK: %[[UB:.*]] = fir.convert %[[C10]] : (i32) -> index
! CHECK: %[[STEP:.*]] = arith.constant 1 : index
+ ! CHECK: omp.parallel {
+
+ ! CHECK-NEXT: %[[ITER_VAR:.*]] = fir.alloca i32 {bindc_name = "i"}
+ ! CHECK-NEXT: %[[BINDING:.*]]:2 = hlfir.declare %[[ITER_VAR]] {uniq_name = "_QFEi"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+
! CHECK: omp.wsloop {
! CHECK-NEXT: omp.loop_nest (%[[ARG0:.*]]) : index = (%[[LB]]) to (%[[UB]]) inclusive step (%[[STEP]]) {
! CHECK-NEXT: %[[IV_IDX:.*]] = fir.convert %[[ARG0]] : (index) -> i32
diff --git a/flang/test/Transforms/DoConcurrent/basic_host.mlir b/flang/test/Transforms/DoConcurrent/basic_host.mlir
new file mode 100644
index 0000000000000..b53ecd687c039
--- /dev/null
+++ b/flang/test/Transforms/DoConcurrent/basic_host.mlir
@@ -0,0 +1,62 @@
+// Tests mapping of a basic `do concurrent` loop to `!$omp parallel do`.
+
+// RUN: fir-opt --omp-do-concurrent-conversion="map-to=host" %s | FileCheck %s
+
+// CHECK-LABEL: func.func @do_concurrent_basic
+func.func @do_concurrent_basic() attributes {fir.bindc_name = "do_concurrent_basic"} {
+ // CHECK: %[[ARR:.*]]:2 = hlfir.declare %{{.*}}(%{{.*}}) {uniq_name = "_QFEa"} : (!fir.ref<!fir.array<10xi32>>, !fir.shape<1>) -> (!fir.ref<!fir.array<10xi32>>, !fir.ref<!fir.array<10xi32>>)
+
+ %0 = fir.alloca i32 {bindc_name = "i"}
+ %1:2 = hlfir.declare %0 {uniq_name = "_QFEi"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+ %2 = fir.address_of(@_QFEa) : !fir.ref<!fir.array<10xi32>>
+ %c10 = arith.constant 10 : index
+ %3 = fir.shape %c10 : (index) -> !fir.shape<1>
+ %4:2 = hlfir.declare %2(%3) {uniq_name = "_QFEa"} : (!fir.ref<!fir.array<10xi32>>, !fir.shape<1>) -> (!fir.ref<!fir.array<10xi32>>, !fir.ref<!fir.array<10xi32>>)
+ %c1_i32 = arith.constant 1 : i32
+ %7 = fir.convert %c1_i32 : (i32) -> index
+ %c10_i32 = arith.constant 10 : i32
+ %8 = fir.convert %c10_i32 : (i32) -> index
+ %c1 = arith.constant 1 : index
+
+ // CHECK-NOT: fir.do_loop
+
+ // CHECK: %[[C1:.*]] = arith.constant 1 : i32
+ // CHECK: %[[LB:.*]] = fir.convert %[[C1]] : (i32) -> index
+ // CHECK: %[[C10:.*]] = arith.constant 10 : i32
+ // CHECK: %[[UB:.*]] = fir.convert %[[C10]] : (i32) -> index
+ // CHECK: %[[STEP:.*]] = arith.constant 1 : index
+
+ // CHECK: omp.parallel {
+
+ // CHECK-NEXT: %[[ITER_VAR:.*]] = fir.alloca i32 {bindc_name = "i"}
+ // CHECK-NEXT: %[[BINDING:.*]]:2 = hlfir.declare %[[ITER_VAR]] {uniq_name = "_QFEi"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
+
+ // CHECK: omp.wsloop {
+ // CHECK-NEXT: omp.loop_nest (%[[ARG0:.*]]) : index = (%[[LB]]) to (%[[UB]]) inclusive step (%[[STEP]]) {
+ // CHECK-NEXT: %[[IV_IDX:.*]] = fir.convert %[[ARG0]] : (index) -> i32
+ // CHECK-NEXT: fir.store %[[IV_IDX]] to %[[BINDING]]#1 : !fir.ref<i32>
+ // CHECK-NEXT: %[[IV_VAL1:.*]] = fir.load %[[BINDING]]#0 : !fir.ref<i32>
+ // CHECK-NEXT: %[[IV_VAL2:.*]] = fir.load %[[BINDING]]#0 : !fir.ref<i32>
+ // CHECK-NEXT: %[[IV_VAL_I64:.*]] = fir.convert %[[IV_VAL2]] : (i32) -> i64
+ // CHECK-NEXT: %[[ARR_ACCESS:.*]] = hlfir.designate %[[ARR]]#0 (%[[IV_VAL_I64]]) : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
+ // CHECK-NEXT: hlfir.assign %[[IV_VAL1]] to %[[ARR_ACCESS]] : i32, !fir.ref<i32>
+ // CHECK-NEXT: omp.yield
+ // CHECK-NEXT: }
+ // CHECK-NEXT: }
+
+ // CHECK-NEXT: omp.terminator
+ // CHECK-NEXT: }
+ fir.do_loop %arg0 = %7 to %8 step %c1 unordered {
+ %13 = fir.convert %arg0 : (index) -> i32
+ fir.store %13 to %1#1 : !fir.ref<i32>
+ %14 = fir.load %1#0 : !fir.ref<i32>
+ %15 = fir.load %1#0 : !fir.ref<i32>
+ %16 = fir.convert %15 : (i32) -> i64
+ %17 = hlfir.designate %4#0 (%16) : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
+ hlfir.assign %14 to %17 : i32, !fir.ref<i32>
+ }
+
+ // CHECK-NOT: fir.do_loop
+
+ return
+ }
diff --git a/flang/test/Transforms/DoConcurrent/non_const_bounds.f90 b/flang/test/Transforms/DoConcurrent/non_const_bounds.f90
new file mode 100644
index 0000000000000..48e0afe6752b6
--- /dev/null
+++ b/flang/test/Transforms/DoConcurrent/non_const_bounds.f90
@@ -0,0 +1,45 @@
+! RUN: %flang_fc1 -emit-hlfir -fopenmp -fdo-concurrent-to-openmp=host %s -o - \
+! RUN: | FileCheck %s
+
+program main
+ implicit none
+
+ call foo(10)
+
+ contains
+ subroutine foo(n)
+ implicit none
+ integer :: n
+ integer :: i
+ integer, dimension(n) :: a
+
+ do concurrent(i=1:n)
+ a(i) = i
+ end do
+ end subroutine
+
+end program main
+
+! CHECK: %[[N_DECL:.*]]:2 = hlfir.declare %{{.*}} dummy_scope %{{.*}} {uniq_name = "_QFFfooEn"}
+
+! CHECK: fir.load
+
+! CHECK: %[[LB:.*]] = fir.convert %{{c1_.*}} : (i32) -> index
+! CHECK: %[[N_VAL:.*]] = fir.load %[[N_DECL]]#0 : !fir.ref<i32>
+! CHECK: %[[UB:.*]] = fir.convert %[[N_VAL]] : (i32) -> index
+! CHECK: %[[C1:.*]] = arith.constant 1 : index
+
+! CHECK: omp.parallel {
+
+
+! Verify that we restort to using the outside val...
[truncated]
|
0ecf2e2
to
06bf9bc
Compare
41f77da
to
2561e21
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, a couple of comments from me. But the general approach looks good.
/// `%ind_var_conv` -> `%ind_var` conversion sequence. | ||
/// | ||
/// For why this is useful, see its use in `findLoopIndVarMemDecl`. | ||
bool isIndVarUltimateOperand(mlir::Operation *op, fir::DoLoopOp doLoop) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too sure about this function. What it actually checks is that either the induction variable of doLoop
is one of the operands of op
, that the operation defining specifically the first operand of op
has the induction variable as one of its operands, that the operation defining its own first operand has it, etc.
Rather than checking whether the induction variable is ultimately one of its operands, it seems to check that one of its operands is or has been obtained from (among potentially other values) the induction variable, with the seemingly arbitrary condition that, if it's obtained indirectly, it must come from a chain of "first-operands".
If I understand it correctly, the first example would meet that condition, whereas the second wouldn't. I think that these should both be treated the same:
fir.do_loop %ind_var = ... unordered {
%ind_var_conv = fir.convert %ind_var : (index) -> i32
// isIndVarUltimateOperand() returns true
%add = arith.addi %ind_var_conv, %x : i32
...
}
fir.do_loop %ind_var = ... unordered {
%ind_var_conv = fir.convert %ind_var : (index) -> i32
// isIndVarUltimateOperand() returns false
%add = arith.addi %x, %ind_var_conv : i32
...
}
If this is intended to handle fir.convert
and fir.store
, I think it makes more sense to actually check for these specific operation types and usage patterns rather than creating a pseudo-generic check that can easily break. I don't think the solution is to check all operands, because at that point we're just checking whether the induction variable participated in the expression that was passed in, which doesn't seem to be what we want here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified the logic to match what flang currently emits.
2561e21
to
6ef2fae
Compare
06bf9bc
to
a615d77
Compare
92d976d
to
ca40210
Compare
a615d77
to
d2e3c77
Compare
This PR is ready for another round of reviews. Please take a look when you have time 🙏 |
Ping! Please have a look when you have time. |
ca40210
to
9339d07
Compare
The same issue (requesting reviews from an enormous amount of people) happened here as well. No idea why. I just did a simple rebase on |
…126026) This PR starts the effort to upstream AMD's internal implementation of `do concurrent` to OpenMP mapping. This replaces llvm#77285 since we extended this WIP quite a bit on our fork over the past year. An important part of this PR is a document that describes the current status downstream, the upstreaming status, and next steps to make this pass much more useful. In addition to this document, this PR also contains the skeleton of the pass (no useful transformations are done yet) and some testing for the added command line options. This looks like a huge PR but a lot of the added stuff is documentation. It is also worth noting that the downstream pass has been validated on https://github.com/BerkeleyLab/fiats. For the CPU mapping, this achived performance speed-ups that match pure OpenMP, for GPU mapping we are still working on extending our support for implicit memory mapping and locality specifiers. PR stack: - llvm#126026 (this PR) - llvm#127595 - llvm#127633 - llvm#127634 - llvm#127635
…27595) Upstreams the next part of do concurrent to OpenMP mapping pass (from AMD's ROCm implementation). See llvm#126026 for more context. This PR add loop nest detection logic. This enables us to discover muli-range do concurrent loops and then map them as "collapsed" loop nests to OpenMP. This is a follow up for llvm#126026, only the latest commit is relevant. This is a replacement for llvm#127478 using a `/user/<username>/<branchname>` branch. PR stack: - llvm#126026 - llvm#127595 (this PR) - llvm#127633 - llvm#127634 - llvm#127635
…ructs (llvm#127633) Upstreams one more part of the ROCm `do concurrent` to OpenMP mapping pass. This PR add support for converting simple loops to the equivalent OpenMP constructs on the host: `omp parallel do`. Towards that end, we have to collect more information about loop nests for which we add new utils in the `looputils` name space. PR stack: - llvm#126026 - llvm#127595 - llvm#127633 (this PR) - llvm#127634 - llvm#127635
…lvm#127634) Adds support for converting mulit-range loops to OpenMP (on the host only for now). The changes here "prepare" a loop nest for collapsing by sinking iteration variables to the innermost `fir.do_loop` op in the nest. PR stack: - llvm#126026 - llvm#127595 - llvm#127633 - llvm#127634 (this PR) - llvm#127635
…lvm#127635) Extends `do concurrent` mapping to handle "loop-local values". A loop-local value is one that is used exclusively inside the loop but allocated outside of it. This usually corresponds to temporary values that are used inside the loop body for initialzing other variables for example. After collecting these values, the pass localizes them to the loop nest by moving their allocations. PR stack: - llvm#126026 - llvm#127595 - llvm#127633 - llvm#127634 - llvm#127635 (this PR)
This PR starts the effort to upstream AMD's internal implementation of `do concurrent` to OpenMP mapping. This replaces #77285 since we extended this WIP quite a bit on our fork over the past year. An important part of this PR is a document that describes the current status downstream, the upstreaming status, and next steps to make this pass much more useful. In addition to this document, this PR also contains the skeleton of the pass (no useful transformations are done yet) and some testing for the added command line options. This looks like a huge PR but a lot of the added stuff is documentation. It is also worth noting that the downstream pass has been validated on https://github.com/BerkeleyLab/fiats. For the CPU mapping, this achived performance speed-ups that match pure OpenMP, for GPU mapping we are still working on extending our support for implicit memory mapping and locality specifiers. PR stack: - #126026 (this PR) - #127595 - #127633 - #127634 - #127635
2c3ed0e
to
e15842d
Compare
…ping (#126026) This PR starts the effort to upstream AMD's internal implementation of `do concurrent` to OpenMP mapping. This replaces #77285 since we extended this WIP quite a bit on our fork over the past year. An important part of this PR is a document that describes the current status downstream, the upstreaming status, and next steps to make this pass much more useful. In addition to this document, this PR also contains the skeleton of the pass (no useful transformations are done yet) and some testing for the added command line options. This looks like a huge PR but a lot of the added stuff is documentation. It is also worth noting that the downstream pass has been validated on https://github.com/BerkeleyLab/fiats. For the CPU mapping, this achived performance speed-ups that match pure OpenMP, for GPU mapping we are still working on extending our support for implicit memory mapping and locality specifiers. PR stack: - llvm/llvm-project#126026 (this PR) - llvm/llvm-project#127595 - llvm/llvm-project#127633 - llvm/llvm-project#127634 - llvm/llvm-project#127635
Upstreams the next part of do concurrent to OpenMP mapping pass (from AMD's ROCm implementation). See #126026 for more context. This PR add loop nest detection logic. This enables us to discover muli-range do concurrent loops and then map them as "collapsed" loop nests to OpenMP. This is a follow up for #126026, only the latest commit is relevant. This is a replacement for #127478 using a `/user/<username>/<branchname>` branch. PR stack: - #126026 - #127595 (this PR) - #127633 - #127634 - #127635
…ructs Upstreams one more part of the ROCm `do concurrent` to OpenMP mapping pass. This PR add support for converting simple loops to the equivalent OpenMP constructs on the host: `omp parallel do`. Towards that end, we have to collect more information about loop nests for which we add new utils in the `looputils` name space.
25b36c6
to
0243c4f
Compare
…on. (#127595) Upstreams the next part of do concurrent to OpenMP mapping pass (from AMD's ROCm implementation). See llvm/llvm-project#126026 for more context. This PR add loop nest detection logic. This enables us to discover muli-range do concurrent loops and then map them as "collapsed" loop nests to OpenMP. This is a follow up for llvm/llvm-project#126026, only the latest commit is relevant. This is a replacement for llvm/llvm-project#127478 using a `/user/<username>/<branchname>` branch. PR stack: - llvm/llvm-project#126026 - llvm/llvm-project#127595 (this PR) - llvm/llvm-project#127633 - llvm/llvm-project#127634 - llvm/llvm-project#127635
… host constructs (#127633) Upstreams one more part of the ROCm `do concurrent` to OpenMP mapping pass. This PR add support for converting simple loops to the equivalent OpenMP constructs on the host: `omp parallel do`. Towards that end, we have to collect more information about loop nests for which we add new utils in the `looputils` name space. PR stack: - llvm/llvm-project#126026 - llvm/llvm-project#127595 - llvm/llvm-project#127633 (this PR) - llvm/llvm-project#127634 - llvm/llvm-project#127635
…nge loops (#127634) Adds support for converting mulit-range loops to OpenMP (on the host only for now). The changes here "prepare" a loop nest for collapsing by sinking iteration variables to the innermost `fir.do_loop` op in the nest. PR stack: - llvm/llvm-project#126026 - llvm/llvm-project#127595 - llvm/llvm-project#127633 - llvm/llvm-project#127634 (this PR) - llvm/llvm-project#127635
…127635) Extends `do concurrent` mapping to handle "loop-local values". A loop-local value is one that is used exclusively inside the loop but allocated outside of it. This usually corresponds to temporary values that are used inside the loop body for initialzing other variables for example. After collecting these values, the pass localizes them to the loop nest by moving their allocations. PR stack: - #126026 - #127595 - #127633 - #127634 - #127635 (this PR)
…nt` nests (#127635) Extends `do concurrent` mapping to handle "loop-local values". A loop-local value is one that is used exclusively inside the loop but allocated outside of it. This usually corresponds to temporary values that are used inside the loop body for initialzing other variables for example. After collecting these values, the pass localizes them to the loop nest by moving their allocations. PR stack: - llvm/llvm-project#126026 - llvm/llvm-project#127595 - llvm/llvm-project#127633 - llvm/llvm-project#127634 - llvm/llvm-project#127635 (this PR)
Upstreams one more part of the ROCm
do concurrent
to OpenMP mapping pass. This PR add support for converting simple loops to the equivalent OpenMP constructs on the host:omp parallel do
. Towards that end, we have to collect more information about loop nests for which we add new utils in thelooputils
name space.PR stack:
do concurrent
mapping #126026do concurrent
loop-nest detection. #127595do concurrent
loops to OpenMP host constructs #127633 (this PR)do concurrent
mapping to multi-range loops #127634do concurrent
nests #127635