-
Notifications
You must be signed in to change notification settings - Fork 13.3k
[flang][OpenMP] Extend do concurrent
mapping to multi-range loops
#127634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[flang][OpenMP] Extend do concurrent
mapping to multi-range loops
#127634
Conversation
@llvm/pr-subscribers-flang-fir-hlfir Author: Kareem Ergawy (ergawy) ChangesAdds support for converting mulit-range loops to OpenMP (on the host only for now). The changes here "prepare" a loop nest for collapsing by sinking iteration variables to the innermost Full diff: https://github.com/llvm/llvm-project/pull/127634.diff 3 Files Affected:
diff --git a/flang/docs/DoConcurrentConversionToOpenMP.md b/flang/docs/DoConcurrentConversionToOpenMP.md
index 914ace0813f0e..e7665a7751035 100644
--- a/flang/docs/DoConcurrentConversionToOpenMP.md
+++ b/flang/docs/DoConcurrentConversionToOpenMP.md
@@ -173,6 +173,35 @@ omp.parallel {
<!-- TODO -->
+### Multi-range loops
+
+The pass currently supports multi-range loops as well. Given the following
+example:
+
+```fortran
+ do concurrent(i=1:n, j=1:m)
+ a(i,j) = i * j
+ end do
+```
+
+The generated `omp.loop_nest` operation look like:
+
+```
+omp.loop_nest (%arg0, %arg1)
+ : index = (%17, %19) to (%18, %20)
+ inclusive step (%c1_2, %c1_4) {
+ fir.store %arg0 to %private_i#1 : !fir.ref<i32>
+ fir.store %arg1 to %private_j#1 : !fir.ref<i32>
+ ...
+ omp.yield
+}
+```
+
+It is worth noting that we have privatized versions for both iteration
+variables: `i` and `j`. These are locally allocated inside the parallel/target
+OpenMP region similar to what the single-range example in previous section
+shows.
+
<!--
More details about current status will be added along with relevant parts of the
implementation in later upstreaming patches.
diff --git a/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp b/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp
index dc797877ac87b..d86b9f822932d 100644
--- a/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp
+++ b/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp
@@ -245,6 +245,96 @@ mlir::LogicalResult collectLoopNest(fir::DoLoopOp currentLoop,
return mlir::success();
}
+
+/// Prepares the `fir.do_loop` nest to be easily mapped to OpenMP. In
+/// particular, this function would take this input IR:
+/// ```
+/// fir.do_loop %i_iv = %i_lb to %i_ub step %i_step unordered {
+/// fir.store %i_iv to %i#1 : !fir.ref<i32>
+/// %j_lb = arith.constant 1 : i32
+/// %j_ub = arith.constant 10 : i32
+/// %j_step = arith.constant 1 : index
+///
+/// fir.do_loop %j_iv = %j_lb to %j_ub step %j_step unordered {
+/// fir.store %j_iv to %j#1 : !fir.ref<i32>
+/// ...
+/// }
+/// }
+/// ```
+///
+/// into the following form (using generic op form since the result is
+/// technically an invalid `fir.do_loop` op:
+///
+/// ```
+/// "fir.do_loop"(%i_lb, %i_ub, %i_step) <{unordered}> ({
+/// ^bb0(%i_iv: index):
+/// %j_lb = "arith.constant"() <{value = 1 : i32}> : () -> i32
+/// %j_ub = "arith.constant"() <{value = 10 : i32}> : () -> i32
+/// %j_step = "arith.constant"() <{value = 1 : index}> : () -> index
+///
+/// "fir.do_loop"(%j_lb, %j_ub, %j_step) <{unordered}> ({
+/// ^bb0(%new_i_iv: index, %new_j_iv: index):
+/// "fir.store"(%new_i_iv, %i#1) : (i32, !fir.ref<i32>) -> ()
+/// "fir.store"(%new_j_iv, %j#1) : (i32, !fir.ref<i32>) -> ()
+/// ...
+/// })
+/// ```
+///
+/// What happened to the loop nest is the following:
+///
+/// * the innermost loop's entry block was updated from having one operand to
+/// having `n` operands where `n` is the number of loops in the nest,
+///
+/// * the outer loop(s)' ops that update the IVs were sank inside the innermost
+/// loop (see the `"fir.store"(%new_i_iv, %i#1)` op above),
+///
+/// * the innermost loop's entry block's arguments were mapped in order from the
+/// outermost to the innermost IV.
+///
+/// With this IR change, we can directly inline the innermost loop's region into
+/// the newly generated `omp.loop_nest` op.
+///
+/// Note that this function has a pre-condition that \p loopNest consists of
+/// perfectly nested loops; i.e. there are no in-between ops between 2 nested
+/// loops except for the ops to setup the inner loop's LB, UB, and step. These
+/// ops are handled/cloned by `genLoopNestClauseOps(..)`.
+void sinkLoopIVArgs(mlir::ConversionPatternRewriter &rewriter,
+ looputils::LoopNestToIndVarMap &loopNest) {
+ if (loopNest.size() <= 1)
+ return;
+
+ fir::DoLoopOp innermostLoop = loopNest.back().first;
+ mlir::Operation &innermostFirstOp = innermostLoop.getRegion().front().front();
+
+ llvm::SmallVector<mlir::Type> argTypes;
+ llvm::SmallVector<mlir::Location> argLocs;
+
+ for (auto &[doLoop, indVarInfo] : llvm::drop_end(loopNest)) {
+ // Sink the IV update ops to the innermost loop. We need to do for all loops
+ // except for the innermost one, hence the `drop_end` usage above.
+ for (mlir::Operation *op : indVarInfo.indVarUpdateOps)
+ op->moveBefore(&innermostFirstOp);
+
+ argTypes.push_back(doLoop.getInductionVar().getType());
+ argLocs.push_back(doLoop.getInductionVar().getLoc());
+ }
+
+ mlir::Region &innermmostRegion = innermostLoop.getRegion();
+ // Extend the innermost entry block with arguments to represent the outer IVs.
+ innermmostRegion.addArguments(argTypes, argLocs);
+
+ unsigned idx = 1;
+ // In reverse, remap the IVs of the loop nest from the old values to the new
+ // ones. We do that in reverse since the first argument before this loop is
+ // the old IV for the innermost loop. Therefore, we want to replace it first
+ // before the old value (1st argument in the block) is remapped to be the IV
+ // of the outermost loop in the nest.
+ for (auto &[doLoop, _] : llvm::reverse(loopNest)) {
+ doLoop.getInductionVar().replaceAllUsesWith(
+ innermmostRegion.getArgument(innermmostRegion.getNumArguments() - idx));
+ ++idx;
+ }
+}
} // namespace looputils
class DoConcurrentConversion : public mlir::OpConversionPattern<fir::DoLoopOp> {
@@ -267,6 +357,7 @@ class DoConcurrentConversion : public mlir::OpConversionPattern<fir::DoLoopOp> {
"Some `do concurent` loops are not perfectly-nested. "
"These will be serialzied.");
+ looputils::sinkLoopIVArgs(rewriter, loopNest);
mlir::IRMapping mapper;
genParallelOp(doLoop.getLoc(), rewriter, loopNest, mapper);
mlir::omp::LoopNestOperands loopNestClauseOps;
diff --git a/flang/test/Transforms/DoConcurrent/multiple_iteration_ranges.f90 b/flang/test/Transforms/DoConcurrent/multiple_iteration_ranges.f90
new file mode 100644
index 0000000000000..232420fb07a75
--- /dev/null
+++ b/flang/test/Transforms/DoConcurrent/multiple_iteration_ranges.f90
@@ -0,0 +1,72 @@
+! Tests mapping of a `do concurrent` loop with multiple iteration ranges.
+
+! RUN: split-file %s %t
+
+! RUN: %flang_fc1 -emit-hlfir -fopenmp -fdo-concurrent-to-openmp=host %t/multi_range.f90 -o - \
+! RUN: | FileCheck %s
+
+!--- multi_range.f90
+program main
+ integer, parameter :: n = 20
+ integer, parameter :: m = 40
+ integer, parameter :: l = 60
+ integer :: a(n, m, l)
+
+ do concurrent(i=3:n, j=5:m, k=7:l)
+ a(i,j,k) = i * j + k
+ end do
+end
+
+! CHECK: func.func @_QQmain
+
+! CHECK: %[[C3:.*]] = arith.constant 3 : i32
+! CHECK: %[[LB_I:.*]] = fir.convert %[[C3]] : (i32) -> index
+! CHECK: %[[C20:.*]] = arith.constant 20 : i32
+! CHECK: %[[UB_I:.*]] = fir.convert %[[C20]] : (i32) -> index
+! CHECK: %[[STEP_I:.*]] = arith.constant 1 : index
+
+! CHECK: %[[C5:.*]] = arith.constant 5 : i32
+! CHECK: %[[LB_J:.*]] = fir.convert %[[C5]] : (i32) -> index
+! CHECK: %[[C40:.*]] = arith.constant 40 : i32
+! CHECK: %[[UB_J:.*]] = fir.convert %[[C40]] : (i32) -> index
+! CHECK: %[[STEP_J:.*]] = arith.constant 1 : index
+
+! CHECK: %[[C7:.*]] = arith.constant 7 : i32
+! CHECK: %[[LB_K:.*]] = fir.convert %[[C7]] : (i32) -> index
+! CHECK: %[[C60:.*]] = arith.constant 60 : i32
+! CHECK: %[[UB_K:.*]] = fir.convert %[[C60]] : (i32) -> index
+! CHECK: %[[STEP_K:.*]] = arith.constant 1 : index
+
+! CHECK: omp.parallel {
+
+! CHECK-NEXT: %[[ITER_VAR_I:.*]] = fir.alloca i32 {bindc_name = "i"}
+! CHECK-NEXT: %[[BINDING_I:.*]]:2 = hlfir.declare %[[ITER_VAR_I]] {uniq_name = "_QFEi"}
+
+! CHECK-NEXT: %[[ITER_VAR_J:.*]] = fir.alloca i32 {bindc_name = "j"}
+! CHECK-NEXT: %[[BINDING_J:.*]]:2 = hlfir.declare %[[ITER_VAR_J]] {uniq_name = "_QFEj"}
+
+! CHECK-NEXT: %[[ITER_VAR_K:.*]] = fir.alloca i32 {bindc_name = "k"}
+! CHECK-NEXT: %[[BINDING_K:.*]]:2 = hlfir.declare %[[ITER_VAR_K]] {uniq_name = "_QFEk"}
+
+! CHECK: omp.wsloop {
+! CHECK-NEXT: omp.loop_nest
+! CHECK-SAME: (%[[ARG0:[^[:space:]]+]], %[[ARG1:[^[:space:]]+]], %[[ARG2:[^[:space:]]+]])
+! CHECK-SAME: : index = (%[[LB_I]], %[[LB_J]], %[[LB_K]])
+! CHECK-SAME: to (%[[UB_I]], %[[UB_J]], %[[UB_K]]) inclusive
+! CHECK-SAME: step (%[[STEP_I]], %[[STEP_J]], %[[STEP_K]]) {
+
+! CHECK-NEXT: %[[IV_IDX_I:.*]] = fir.convert %[[ARG0]]
+! CHECK-NEXT: fir.store %[[IV_IDX_I]] to %[[BINDING_I]]#1
+
+! CHECK-NEXT: %[[IV_IDX_J:.*]] = fir.convert %[[ARG1]]
+! CHECK-NEXT: fir.store %[[IV_IDX_J]] to %[[BINDING_J]]#1
+
+! CHECK-NEXT: %[[IV_IDX_K:.*]] = fir.convert %[[ARG2]]
+! CHECK-NEXT: fir.store %[[IV_IDX_K]] to %[[BINDING_K]]#1
+
+! CHECK: omp.yield
+! CHECK-NEXT: }
+! CHECK-NEXT: }
+
+! CHECK-NEXT: omp.terminator
+! CHECK-NEXT: }
|
0ecf2e2
to
06bf9bc
Compare
6d040c8
to
4c63b2a
Compare
06bf9bc
to
a615d77
Compare
4c63b2a
to
40d1415
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Kareem, some small comments from me.
/// Collects the op(s) responsible for updating a loop's iteration variable with | ||
/// the current iteration number. For example, for the input IR: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function seems to do something more generic than that: it collects all of the ops that either take the loop's induction variable as argument or take a value as argument that has been calculated based on the result of another operation that directly or indirectly took the loop's induction variable as argument.
I guess that, similarly to another comment I left at a previous PR in the stack #127633 (comment), it's doing something more general than it states. If, like the other case, the idea is to just store the associated fir.convert
and fir.store
operations, perhaps it makes more sense to match that pattern specifically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified the function to match the current flang pattern. I will mark the above comments as resolved since they don't apply anymore.
a615d77
to
d2e3c77
Compare
66ce019
to
b50be98
Compare
40d1415
to
090ea42
Compare
fdf28a2
to
70979d8
Compare
f7322fc
to
866276c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Kareem, LGTM!
/// ``` | ||
/// | ||
/// into the following form (using generic op form since the result is | ||
/// technically an invalid `fir.do_loop` op: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// technically an invalid `fir.do_loop` op: | |
/// technically an invalid `fir.do_loop` op): |
/// The operation allocating memory for iteration variable. | ||
mlir::Operation *iterVarMemDef; | ||
}; | ||
/// the operation(s) updating the iteration variable with the current |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// the operation(s) updating the iteration variable with the current | |
/// The operation(s) updating the iteration variable with the current |
866276c
to
7b60c5b
Compare
Again sorry, GH is acting weird!! |
…126026) This PR starts the effort to upstream AMD's internal implementation of `do concurrent` to OpenMP mapping. This replaces llvm#77285 since we extended this WIP quite a bit on our fork over the past year. An important part of this PR is a document that describes the current status downstream, the upstreaming status, and next steps to make this pass much more useful. In addition to this document, this PR also contains the skeleton of the pass (no useful transformations are done yet) and some testing for the added command line options. This looks like a huge PR but a lot of the added stuff is documentation. It is also worth noting that the downstream pass has been validated on https://github.com/BerkeleyLab/fiats. For the CPU mapping, this achived performance speed-ups that match pure OpenMP, for GPU mapping we are still working on extending our support for implicit memory mapping and locality specifiers. PR stack: - llvm#126026 (this PR) - llvm#127595 - llvm#127633 - llvm#127634 - llvm#127635
…27595) Upstreams the next part of do concurrent to OpenMP mapping pass (from AMD's ROCm implementation). See llvm#126026 for more context. This PR add loop nest detection logic. This enables us to discover muli-range do concurrent loops and then map them as "collapsed" loop nests to OpenMP. This is a follow up for llvm#126026, only the latest commit is relevant. This is a replacement for llvm#127478 using a `/user/<username>/<branchname>` branch. PR stack: - llvm#126026 - llvm#127595 (this PR) - llvm#127633 - llvm#127634 - llvm#127635
…ructs (llvm#127633) Upstreams one more part of the ROCm `do concurrent` to OpenMP mapping pass. This PR add support for converting simple loops to the equivalent OpenMP constructs on the host: `omp parallel do`. Towards that end, we have to collect more information about loop nests for which we add new utils in the `looputils` name space. PR stack: - llvm#126026 - llvm#127595 - llvm#127633 (this PR) - llvm#127634 - llvm#127635
…lvm#127634) Adds support for converting mulit-range loops to OpenMP (on the host only for now). The changes here "prepare" a loop nest for collapsing by sinking iteration variables to the innermost `fir.do_loop` op in the nest. PR stack: - llvm#126026 - llvm#127595 - llvm#127633 - llvm#127634 (this PR) - llvm#127635
…lvm#127635) Extends `do concurrent` mapping to handle "loop-local values". A loop-local value is one that is used exclusively inside the loop but allocated outside of it. This usually corresponds to temporary values that are used inside the loop body for initialzing other variables for example. After collecting these values, the pass localizes them to the loop nest by moving their allocations. PR stack: - llvm#126026 - llvm#127595 - llvm#127633 - llvm#127634 - llvm#127635 (this PR)
This PR starts the effort to upstream AMD's internal implementation of `do concurrent` to OpenMP mapping. This replaces #77285 since we extended this WIP quite a bit on our fork over the past year. An important part of this PR is a document that describes the current status downstream, the upstreaming status, and next steps to make this pass much more useful. In addition to this document, this PR also contains the skeleton of the pass (no useful transformations are done yet) and some testing for the added command line options. This looks like a huge PR but a lot of the added stuff is documentation. It is also worth noting that the downstream pass has been validated on https://github.com/BerkeleyLab/fiats. For the CPU mapping, this achived performance speed-ups that match pure OpenMP, for GPU mapping we are still working on extending our support for implicit memory mapping and locality specifiers. PR stack: - #126026 (this PR) - #127595 - #127633 - #127634 - #127635
…ping (#126026) This PR starts the effort to upstream AMD's internal implementation of `do concurrent` to OpenMP mapping. This replaces #77285 since we extended this WIP quite a bit on our fork over the past year. An important part of this PR is a document that describes the current status downstream, the upstreaming status, and next steps to make this pass much more useful. In addition to this document, this PR also contains the skeleton of the pass (no useful transformations are done yet) and some testing for the added command line options. This looks like a huge PR but a lot of the added stuff is documentation. It is also worth noting that the downstream pass has been validated on https://github.com/BerkeleyLab/fiats. For the CPU mapping, this achived performance speed-ups that match pure OpenMP, for GPU mapping we are still working on extending our support for implicit memory mapping and locality specifiers. PR stack: - llvm/llvm-project#126026 (this PR) - llvm/llvm-project#127595 - llvm/llvm-project#127633 - llvm/llvm-project#127634 - llvm/llvm-project#127635
Upstreams the next part of do concurrent to OpenMP mapping pass (from AMD's ROCm implementation). See #126026 for more context. This PR add loop nest detection logic. This enables us to discover muli-range do concurrent loops and then map them as "collapsed" loop nests to OpenMP. This is a follow up for #126026, only the latest commit is relevant. This is a replacement for #127478 using a `/user/<username>/<branchname>` branch. PR stack: - #126026 - #127595 (this PR) - #127633 - #127634 - #127635
25b36c6
to
0243c4f
Compare
…on. (#127595) Upstreams the next part of do concurrent to OpenMP mapping pass (from AMD's ROCm implementation). See llvm/llvm-project#126026 for more context. This PR add loop nest detection logic. This enables us to discover muli-range do concurrent loops and then map them as "collapsed" loop nests to OpenMP. This is a follow up for llvm/llvm-project#126026, only the latest commit is relevant. This is a replacement for llvm/llvm-project#127478 using a `/user/<username>/<branchname>` branch. PR stack: - llvm/llvm-project#126026 - llvm/llvm-project#127595 (this PR) - llvm/llvm-project#127633 - llvm/llvm-project#127634 - llvm/llvm-project#127635
…ructs (#127633) Upstreams one more part of the ROCm `do concurrent` to OpenMP mapping pass. This PR add support for converting simple loops to the equivalent OpenMP constructs on the host: `omp parallel do`. Towards that end, we have to collect more information about loop nests for which we add new utils in the `looputils` name space. PR stack: - #126026 - #127595 - #127633 (this PR) - #127634 - #127635
Adds support for converting mulit-range loops to OpenMP (on the host only for now). The changes here "prepare" a loop nest for collapsing by sinking iteration variables to the innermost `fir.do_loop` op in the nest.
7b60c5b
to
629305b
Compare
… host constructs (#127633) Upstreams one more part of the ROCm `do concurrent` to OpenMP mapping pass. This PR add support for converting simple loops to the equivalent OpenMP constructs on the host: `omp parallel do`. Towards that end, we have to collect more information about loop nests for which we add new utils in the `looputils` name space. PR stack: - llvm/llvm-project#126026 - llvm/llvm-project#127595 - llvm/llvm-project#127633 (this PR) - llvm/llvm-project#127634 - llvm/llvm-project#127635
Merging since the only remaining check is the Windows pre-merge check and this has been stuck for a long time (tried restarting the check and still gets stuck). |
…nge loops (#127634) Adds support for converting mulit-range loops to OpenMP (on the host only for now). The changes here "prepare" a loop nest for collapsing by sinking iteration variables to the innermost `fir.do_loop` op in the nest. PR stack: - llvm/llvm-project#126026 - llvm/llvm-project#127595 - llvm/llvm-project#127633 - llvm/llvm-project#127634 (this PR) - llvm/llvm-project#127635
…127635) Extends `do concurrent` mapping to handle "loop-local values". A loop-local value is one that is used exclusively inside the loop but allocated outside of it. This usually corresponds to temporary values that are used inside the loop body for initialzing other variables for example. After collecting these values, the pass localizes them to the loop nest by moving their allocations. PR stack: - #126026 - #127595 - #127633 - #127634 - #127635 (this PR)
…nt` nests (#127635) Extends `do concurrent` mapping to handle "loop-local values". A loop-local value is one that is used exclusively inside the loop but allocated outside of it. This usually corresponds to temporary values that are used inside the loop body for initialzing other variables for example. After collecting these values, the pass localizes them to the loop nest by moving their allocations. PR stack: - llvm/llvm-project#126026 - llvm/llvm-project#127595 - llvm/llvm-project#127633 - llvm/llvm-project#127634 - llvm/llvm-project#127635 (this PR)
Adds support for converting mulit-range loops to OpenMP (on the host only for now). The changes here "prepare" a loop nest for collapsing by sinking iteration variables to the innermost
fir.do_loop
op in the nest.PR stack:
do concurrent
mapping #126026do concurrent
loop-nest detection. #127595do concurrent
loops to OpenMP host constructs #127633do concurrent
mapping to multi-range loops #127634 (this PR)do concurrent
nests #127635