Skip to content

Commit 9339d07

Browse files
committed
[flang][OpenMP] Upstream do concurrent loop-nest detection.
Upstreams the next part of `do concurrent` to OpenMP mapping pass (from AMD's ROCm implementation). See #126026 for more context. This PR add loop nest detection logic. This enables us to discover muli-range `do concurrent` loops and then map them as "collapsed" loop nests to OpenMP.
1 parent 178f525 commit 9339d07

File tree

3 files changed

+309
-0
lines changed

3 files changed

+309
-0
lines changed

flang/docs/DoConcurrentConversionToOpenMP.md

+85
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,79 @@ that:
5353
* It has been tested in a very limited way so far.
5454
* It has been tested mostly on simple synthetic inputs.
5555

56+
### Loop nest detection
57+
58+
On the `FIR` dialect level, the following loop:
59+
```fortran
60+
do concurrent(i=1:n, j=1:m, k=1:o)
61+
a(i,j,k) = i + j + k
62+
end do
63+
```
64+
is modelled as a nest of `fir.do_loop` ops such that an outer loop's region
65+
contains **only** the following:
66+
1. The operations needed to assign/update the outer loop's induction variable.
67+
1. The inner loop itself.
68+
69+
So the MLIR structure for the above example looks similar to the following:
70+
```
71+
fir.do_loop %i_idx = %34 to %36 step %c1 unordered {
72+
%i_idx_2 = fir.convert %i_idx : (index) -> i32
73+
fir.store %i_idx_2 to %i_iv#1 : !fir.ref<i32>
74+
75+
fir.do_loop %j_idx = %37 to %39 step %c1_3 unordered {
76+
%j_idx_2 = fir.convert %j_idx : (index) -> i32
77+
fir.store %j_idx_2 to %j_iv#1 : !fir.ref<i32>
78+
79+
fir.do_loop %k_idx = %40 to %42 step %c1_5 unordered {
80+
%k_idx_2 = fir.convert %k_idx : (index) -> i32
81+
fir.store %k_idx_2 to %k_iv#1 : !fir.ref<i32>
82+
83+
... loop nest body goes here ...
84+
}
85+
}
86+
}
87+
```
88+
This applies to multi-range loops in general; they are represented in the IR as
89+
a nest of `fir.do_loop` ops with the above nesting structure.
90+
91+
Therefore, the pass detects such "perfectly" nested loop ops to identify multi-range
92+
loops and map them as "collapsed" loops in OpenMP.
93+
94+
#### Further info regarding loop nest detection
95+
96+
Loop nest detection is currently limited to the scenario described in the previous
97+
section. However, this is quite limited and can be extended in the future to cover
98+
more cases. At the moment, for the following loop nest, even though both loops are
99+
perfectly nested, only the outer loop is parallelized:
100+
```fortran
101+
do concurrent(i=1:n)
102+
do concurrent(j=1:m)
103+
a(i,j) = i * j
104+
end do
105+
end do
106+
```
107+
108+
Similarly, for the following loop nest, even though the intervening statement `x = 41`
109+
does not have any memory effects that would affect parallelization, this nest is
110+
not parallelized either (only the outer loop is).
111+
112+
```fortran
113+
do concurrent(i=1:n)
114+
x = 41
115+
do concurrent(j=1:m)
116+
a(i,j) = i * j
117+
end do
118+
end do
119+
```
120+
121+
The above also has the consequence that the `j` variable will **not** be
122+
privatized in the OpenMP parallel/target region. In other words, it will be
123+
treated as if it was a `shared` variable. For more details about privatization,
124+
see the "Data environment" section below.
125+
126+
See `flang/test/Transforms/DoConcurrent/loop_nest_test.f90` for more examples
127+
of what is and is not detected as a perfect loop nest.
128+
56129
<!--
57130
More details about current status will be added along with relevant parts of the
58131
implementation in later upstreaming patches.
@@ -63,6 +136,17 @@ implementation in later upstreaming patches.
63136
This section describes some of the open questions/issues that are not tackled yet
64137
even in the downstream implementation.
65138

139+
### Separate MLIR op for `do concurrent`
140+
141+
At the moment, both increment and concurrent loops are represented by one MLIR
142+
op: `fir.do_loop`; where we differentiate concurrent loops with the `unordered`
143+
attribute. This is not ideal since the `fir.do_loop` op support only single
144+
iteration ranges. Consequently, to model multi-range `do concurrent` loops, flang
145+
emits a nest of `fir.do_loop` ops which we have to detect in the OpenMP conversion
146+
pass to handle multi-range loops. Instead, it would better to model multi-range
147+
concurrent loops using a separate op which the IR more representative of the input
148+
Fortran code and also easier to detect and transform.
149+
66150
### Delayed privatization
67151

68152
So far, we emit the privatization logic for IVs inline in the parallel/target
@@ -150,6 +234,7 @@ targeting OpenMP.
150234
- [x] Command line options for `flang` and `bbc`.
151235
- [x] Conversion pass skeleton (no transormations happen yet).
152236
- [x] Status description and tracking document (this document).
237+
- [x] Loop nest detection to identify multi-range loops.
153238
- [ ] Basic host/CPU mapping support.
154239
- [ ] Basic device/GPU mapping support.
155240
- [ ] More advanced host and device support (expaned to multiple items as needed).

flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp

+135
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,10 @@
99
#include "flang/Optimizer/Dialect/FIROps.h"
1010
#include "flang/Optimizer/OpenMP/Passes.h"
1111
#include "flang/Optimizer/OpenMP/Utils.h"
12+
#include "mlir/Analysis/SliceAnalysis.h"
1213
#include "mlir/Dialect/OpenMP/OpenMPDialect.h"
1314
#include "mlir/Transforms/DialectConversion.h"
15+
#include "mlir/Transforms/RegionUtils.h"
1416

1517
namespace flangomp {
1618
#define GEN_PASS_DEF_DOCONCURRENTCONVERSIONPASS
@@ -21,6 +23,131 @@ namespace flangomp {
2123
#define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE << "]: ")
2224

2325
namespace {
26+
namespace looputils {
27+
using LoopNest = llvm::SetVector<fir::DoLoopOp>;
28+
29+
/// Loop \p innerLoop is considered perfectly-nested inside \p outerLoop iff
30+
/// there are no operations in \p outerloop's body other than:
31+
///
32+
/// 1. the operations needed to assign/update \p outerLoop's induction variable.
33+
/// 2. \p innerLoop itself.
34+
///
35+
/// \p return true if \p innerLoop is perfectly nested inside \p outerLoop
36+
/// according to the above definition.
37+
bool isPerfectlyNested(fir::DoLoopOp outerLoop, fir::DoLoopOp innerLoop) {
38+
mlir::ForwardSliceOptions forwardSliceOptions;
39+
forwardSliceOptions.inclusive = true;
40+
// The following will be used as an example to clarify the internals of this
41+
// function:
42+
// ```
43+
// 1. fir.do_loop %i_idx = %34 to %36 step %c1 unordered {
44+
// 2. %i_idx_2 = fir.convert %i_idx : (index) -> i32
45+
// 3. fir.store %i_idx_2 to %i_iv#1 : !fir.ref<i32>
46+
//
47+
// 4. fir.do_loop %j_idx = %37 to %39 step %c1_3 unordered {
48+
// 5. %j_idx_2 = fir.convert %j_idx : (index) -> i32
49+
// 6. fir.store %j_idx_2 to %j_iv#1 : !fir.ref<i32>
50+
// ... loop nest body, possible uses %i_idx ...
51+
// }
52+
// }
53+
// ```
54+
// In this example, the `j` loop is perfectly nested inside the `i` loop and
55+
// below is how we find that.
56+
57+
// We don't care about the outer-loop's induction variable's uses within the
58+
// inner-loop, so we filter out these uses.
59+
//
60+
// This filter tells `getForwardSlice` (below) to only collect operations
61+
// which produce results defined above (i.e. outside) the inner-loop's body.
62+
//
63+
// Since `outerLoop.getInductionVar()` is a block argument (to the
64+
// outer-loop's body), the filter effectively collects uses of
65+
// `outerLoop.getInductionVar()` inside the outer-loop but outside the
66+
// inner-loop.
67+
forwardSliceOptions.filter = [&](mlir::Operation *op) {
68+
return mlir::areValuesDefinedAbove(op->getResults(), innerLoop.getRegion());
69+
};
70+
71+
llvm::SetVector<mlir::Operation *> indVarSlice;
72+
// The forward slice of the `i` loop's IV will be the 2 ops in line 1 & 2
73+
// above. Uses of `%i_idx` inside the `j` loop are not collected because of
74+
// the filter.
75+
mlir::getForwardSlice(outerLoop.getInductionVar(), &indVarSlice,
76+
forwardSliceOptions);
77+
llvm::DenseSet<mlir::Operation *> indVarSet(indVarSlice.begin(),
78+
indVarSlice.end());
79+
80+
llvm::DenseSet<mlir::Operation *> outerLoopBodySet;
81+
// The following walk collects ops inside `outerLoop` that are **not**:
82+
// * the outer-loop itself,
83+
// * or the inner-loop,
84+
// * or the `fir.result` op (the outer-loop's terminator).
85+
//
86+
// For the above example, this will also populate `outerLoopBodySet` with ops
87+
// in line 1 & 2 since we skip the `i` loop, the `j` loop, and the terminator.
88+
outerLoop.walk<mlir::WalkOrder::PreOrder>([&](mlir::Operation *op) {
89+
if (op == outerLoop)
90+
return mlir::WalkResult::advance();
91+
92+
if (op == innerLoop)
93+
return mlir::WalkResult::skip();
94+
95+
if (mlir::isa<fir::ResultOp>(op))
96+
return mlir::WalkResult::advance();
97+
98+
outerLoopBodySet.insert(op);
99+
return mlir::WalkResult::advance();
100+
});
101+
102+
// If `outerLoopBodySet` ends up having the same ops as `indVarSet`, then
103+
// `outerLoop` only contains ops that setup its induction variable +
104+
// `innerLoop` + the `fir.result` terminator. In other words, `innerLoop` is
105+
// perfectly nested inside `outerLoop`.
106+
bool result = (outerLoopBodySet == indVarSet);
107+
mlir::Location loc = outerLoop.getLoc();
108+
LLVM_DEBUG(DBGS() << "Loop pair starting at location " << loc << " is"
109+
<< (result ? "" : " not") << " perfectly nested\n");
110+
111+
return result;
112+
}
113+
114+
/// Starting with `currentLoop` collect a perfectly nested loop nest, if any.
115+
/// This function collects as much as possible loops in the nest; it case it
116+
/// fails to recognize a certain nested loop as part of the nest it just returns
117+
/// the parent loops it discovered before.
118+
mlir::LogicalResult collectLoopNest(fir::DoLoopOp currentLoop,
119+
LoopNest &loopNest) {
120+
assert(currentLoop.getUnordered());
121+
122+
while (true) {
123+
loopNest.insert(currentLoop);
124+
llvm::SmallVector<fir::DoLoopOp> unorderedLoops;
125+
126+
for (auto nestedLoop : currentLoop.getRegion().getOps<fir::DoLoopOp>())
127+
if (nestedLoop.getUnordered())
128+
unorderedLoops.push_back(nestedLoop);
129+
130+
if (unorderedLoops.empty())
131+
break;
132+
133+
// Having more than one unordered loop means that we are not dealing with a
134+
// perfect loop nest (i.e. a mulit-range `do concurrent` loop); which is the
135+
// case we are after here.
136+
if (unorderedLoops.size() > 1)
137+
return mlir::failure();
138+
139+
fir::DoLoopOp nestedUnorderedLoop = unorderedLoops.front();
140+
141+
if (!isPerfectlyNested(currentLoop, nestedUnorderedLoop))
142+
return mlir::failure();
143+
144+
currentLoop = nestedUnorderedLoop;
145+
}
146+
147+
return mlir::success();
148+
}
149+
} // namespace looputils
150+
24151
class DoConcurrentConversion : public mlir::OpConversionPattern<fir::DoLoopOp> {
25152
public:
26153
using mlir::OpConversionPattern<fir::DoLoopOp>::OpConversionPattern;
@@ -31,6 +158,14 @@ class DoConcurrentConversion : public mlir::OpConversionPattern<fir::DoLoopOp> {
31158
mlir::LogicalResult
32159
matchAndRewrite(fir::DoLoopOp doLoop, OpAdaptor adaptor,
33160
mlir::ConversionPatternRewriter &rewriter) const override {
161+
looputils::LoopNest loopNest;
162+
bool hasRemainingNestedLoops =
163+
failed(looputils::collectLoopNest(doLoop, loopNest));
164+
if (hasRemainingNestedLoops)
165+
mlir::emitWarning(doLoop.getLoc(),
166+
"Some `do concurent` loops are not perfectly-nested. "
167+
"These will be serialized.");
168+
34169
// TODO This will be filled in with the next PRs that upstreams the rest of
35170
// the ROCm implementaion.
36171
return mlir::success();
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
! Tests loop-nest detection algorithm for do-concurrent mapping.
2+
3+
! REQUIRES: asserts
4+
5+
! RUN: %flang_fc1 -emit-hlfir -fopenmp -fdo-concurrent-to-openmp=host \
6+
! RUN: -mmlir -debug %s -o - 2> %t.log || true
7+
8+
! RUN: FileCheck %s < %t.log
9+
10+
program main
11+
implicit none
12+
13+
contains
14+
15+
subroutine foo(n)
16+
implicit none
17+
integer :: n, m
18+
integer :: i, j, k
19+
integer :: x
20+
integer, dimension(n) :: a
21+
integer, dimension(n, n, n) :: b
22+
23+
! CHECK: Loop pair starting at location
24+
! CHECK: loc("{{.*}}":[[# @LINE + 1]]:{{.*}}) is perfectly nested
25+
do concurrent(i=1:n, j=1:bar(n*m, n/m))
26+
a(i) = n
27+
end do
28+
29+
! CHECK: Loop pair starting at location
30+
! CHECK: loc("{{.*}}":[[# @LINE + 1]]:{{.*}}) is perfectly nested
31+
do concurrent(i=bar(n, x):n, j=1:bar(n*m, n/m))
32+
a(i) = n
33+
end do
34+
35+
! CHECK: Loop pair starting at location
36+
! CHECK: loc("{{.*}}":[[# @LINE + 1]]:{{.*}}) is not perfectly nested
37+
do concurrent(i=bar(n, x):n)
38+
do concurrent(j=1:bar(n*m, n/m))
39+
a(i) = n
40+
end do
41+
end do
42+
43+
! CHECK: Loop pair starting at location
44+
! CHECK: loc("{{.*}}":[[# @LINE + 1]]:{{.*}}) is not perfectly nested
45+
do concurrent(i=1:n)
46+
x = 10
47+
do concurrent(j=1:m)
48+
b(i,j,k) = i * j + k
49+
end do
50+
end do
51+
52+
! CHECK: Loop pair starting at location
53+
! CHECK: loc("{{.*}}":[[# @LINE + 1]]:{{.*}}) is not perfectly nested
54+
do concurrent(i=1:n)
55+
do concurrent(j=1:m)
56+
b(i,j,k) = i * j + k
57+
end do
58+
x = 10
59+
end do
60+
61+
! CHECK: Loop pair starting at location
62+
! CHECK: loc("{{.*}}":[[# @LINE + 1]]:{{.*}}) is not perfectly nested
63+
do concurrent(i=1:n)
64+
do concurrent(j=1:m)
65+
b(i,j,k) = i * j + k
66+
x = 10
67+
end do
68+
end do
69+
70+
! Verify the (i,j) and (j,k) pairs of loops are detected as perfectly nested.
71+
!
72+
! CHECK: Loop pair starting at location
73+
! CHECK: loc("{{.*}}":[[# @LINE + 3]]:{{.*}}) is perfectly nested
74+
! CHECK: Loop pair starting at location
75+
! CHECK: loc("{{.*}}":[[# @LINE + 1]]:{{.*}}) is perfectly nested
76+
do concurrent(i=bar(n, x):n, j=1:bar(n*m, n/m), k=1:bar(n*m, bar(n*m, n/m)))
77+
a(i) = n
78+
end do
79+
end subroutine
80+
81+
pure function bar(n, m)
82+
implicit none
83+
integer, intent(in) :: n, m
84+
integer :: bar
85+
86+
bar = n + m
87+
end function
88+
89+
end program main

0 commit comments

Comments
 (0)