Skip to content

Commit df59706

Browse files
committed
[flang][OpenMP] Upstream do concurrent loop-nest detection. (llvm#127595)
Upstreams the next part of do concurrent to OpenMP mapping pass (from AMD's ROCm implementation). See llvm#126026 for more context. This PR add loop nest detection logic. This enables us to discover muli-range do concurrent loops and then map them as "collapsed" loop nests to OpenMP. This is a follow up for llvm#126026, only the latest commit is relevant. This is a replacement for llvm#127478 using a `/user/<username>/<branchname>` branch. PR stack: - llvm#126026 - llvm#127595 (this PR) - llvm#127633 - llvm#127634 - llvm#127635
1 parent 2f9d714 commit df59706

File tree

3 files changed

+149
-32
lines changed

3 files changed

+149
-32
lines changed

flang/docs/DoConcurrentConversionToOpenMP.md

+85
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,79 @@ that:
5353
* It has been tested in a very limited way so far.
5454
* It has been tested mostly on simple synthetic inputs.
5555

56+
### Loop nest detection
57+
58+
On the `FIR` dialect level, the following loop:
59+
```fortran
60+
do concurrent(i=1:n, j=1:m, k=1:o)
61+
a(i,j,k) = i + j + k
62+
end do
63+
```
64+
is modelled as a nest of `fir.do_loop` ops such that an outer loop's region
65+
contains **only** the following:
66+
1. The operations needed to assign/update the outer loop's induction variable.
67+
1. The inner loop itself.
68+
69+
So the MLIR structure for the above example looks similar to the following:
70+
```
71+
fir.do_loop %i_idx = %34 to %36 step %c1 unordered {
72+
%i_idx_2 = fir.convert %i_idx : (index) -> i32
73+
fir.store %i_idx_2 to %i_iv#1 : !fir.ref<i32>
74+
75+
fir.do_loop %j_idx = %37 to %39 step %c1_3 unordered {
76+
%j_idx_2 = fir.convert %j_idx : (index) -> i32
77+
fir.store %j_idx_2 to %j_iv#1 : !fir.ref<i32>
78+
79+
fir.do_loop %k_idx = %40 to %42 step %c1_5 unordered {
80+
%k_idx_2 = fir.convert %k_idx : (index) -> i32
81+
fir.store %k_idx_2 to %k_iv#1 : !fir.ref<i32>
82+
83+
... loop nest body goes here ...
84+
}
85+
}
86+
}
87+
```
88+
This applies to multi-range loops in general; they are represented in the IR as
89+
a nest of `fir.do_loop` ops with the above nesting structure.
90+
91+
Therefore, the pass detects such "perfectly" nested loop ops to identify multi-range
92+
loops and map them as "collapsed" loops in OpenMP.
93+
94+
#### Further info regarding loop nest detection
95+
96+
Loop nest detection is currently limited to the scenario described in the previous
97+
section. However, this is quite limited and can be extended in the future to cover
98+
more cases. At the moment, for the following loop nest, even though both loops are
99+
perfectly nested, only the outer loop is parallelized:
100+
```fortran
101+
do concurrent(i=1:n)
102+
do concurrent(j=1:m)
103+
a(i,j) = i * j
104+
end do
105+
end do
106+
```
107+
108+
Similarly, for the following loop nest, even though the intervening statement `x = 41`
109+
does not have any memory effects that would affect parallelization, this nest is
110+
not parallelized either (only the outer loop is).
111+
112+
```fortran
113+
do concurrent(i=1:n)
114+
x = 41
115+
do concurrent(j=1:m)
116+
a(i,j) = i * j
117+
end do
118+
end do
119+
```
120+
121+
The above also has the consequence that the `j` variable will **not** be
122+
privatized in the OpenMP parallel/target region. In other words, it will be
123+
treated as if it was a `shared` variable. For more details about privatization,
124+
see the "Data environment" section below.
125+
126+
See `flang/test/Transforms/DoConcurrent/loop_nest_test.f90` for more examples
127+
of what is and is not detected as a perfect loop nest.
128+
56129
<!--
57130
More details about current status will be added along with relevant parts of the
58131
implementation in later upstreaming patches.
@@ -63,6 +136,17 @@ implementation in later upstreaming patches.
63136
This section describes some of the open questions/issues that are not tackled yet
64137
even in the downstream implementation.
65138

139+
### Separate MLIR op for `do concurrent`
140+
141+
At the moment, both increment and concurrent loops are represented by one MLIR
142+
op: `fir.do_loop`; where we differentiate concurrent loops with the `unordered`
143+
attribute. This is not ideal since the `fir.do_loop` op support only single
144+
iteration ranges. Consequently, to model multi-range `do concurrent` loops, flang
145+
emits a nest of `fir.do_loop` ops which we have to detect in the OpenMP conversion
146+
pass to handle multi-range loops. Instead, it would better to model multi-range
147+
concurrent loops using a separate op which the IR more representative of the input
148+
Fortran code and also easier to detect and transform.
149+
66150
### Delayed privatization
67151

68152
So far, we emit the privatization logic for IVs inline in the parallel/target
@@ -150,6 +234,7 @@ targeting OpenMP.
150234
- [x] Command line options for `flang` and `bbc`.
151235
- [x] Conversion pass skeleton (no transormations happen yet).
152236
- [x] Status description and tracking document (this document).
237+
- [x] Loop nest detection to identify multi-range loops.
153238
- [ ] Basic host/CPU mapping support.
154239
- [ ] Basic device/GPU mapping support.
155240
- [ ] More advanced host and device support (expaned to multiple items as needed).

flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp

+60-30
Original file line numberDiff line numberDiff line change
@@ -316,39 +316,64 @@ void collectIndirectConstOpChain(mlir::Operation *link,
316316
}
317317

318318
/// Loop \p innerLoop is considered perfectly-nested inside \p outerLoop iff
319-
/// there are no operations in \p outerloop's other than:
319+
/// there are no operations in \p outerloop's body other than:
320320
///
321-
/// 1. the operations needed to assing/update \p outerLoop's induction variable.
321+
/// 1. the operations needed to assign/update \p outerLoop's induction variable.
322322
/// 2. \p innerLoop itself.
323323
///
324324
/// \p return true if \p innerLoop is perfectly nested inside \p outerLoop
325325
/// according to the above definition.
326326
bool isPerfectlyNested(fir::DoLoopOp outerLoop, fir::DoLoopOp innerLoop) {
327-
mlir::BackwardSliceOptions backwardSliceOptions;
328-
backwardSliceOptions.inclusive = true;
329-
// We will collect the backward slices for innerLoop's LB, UB, and step.
330-
// However, we want to limit the scope of these slices to the scope of
331-
// outerLoop's region.
332-
backwardSliceOptions.filter = [&](mlir::Operation *op) {
333-
return !mlir::areValuesDefinedAbove(op->getResults(),
334-
outerLoop.getRegion());
335-
};
336-
337327
mlir::ForwardSliceOptions forwardSliceOptions;
338328
forwardSliceOptions.inclusive = true;
329+
// The following will be used as an example to clarify the internals of this
330+
// function:
331+
// ```
332+
// 1. fir.do_loop %i_idx = %34 to %36 step %c1 unordered {
333+
// 2. %i_idx_2 = fir.convert %i_idx : (index) -> i32
334+
// 3. fir.store %i_idx_2 to %i_iv#1 : !fir.ref<i32>
335+
//
336+
// 4. fir.do_loop %j_idx = %37 to %39 step %c1_3 unordered {
337+
// 5. %j_idx_2 = fir.convert %j_idx : (index) -> i32
338+
// 6. fir.store %j_idx_2 to %j_iv#1 : !fir.ref<i32>
339+
// ... loop nest body, possible uses %i_idx ...
340+
// }
341+
// }
342+
// ```
343+
// In this example, the `j` loop is perfectly nested inside the `i` loop and
344+
// below is how we find that.
345+
339346
// We don't care about the outer-loop's induction variable's uses within the
340347
// inner-loop, so we filter out these uses.
348+
//
349+
// This filter tells `getForwardSlice` (below) to only collect operations
350+
// which produce results defined above (i.e. outside) the inner-loop's body.
351+
//
352+
// Since `outerLoop.getInductionVar()` is a block argument (to the
353+
// outer-loop's body), the filter effectively collects uses of
354+
// `outerLoop.getInductionVar()` inside the outer-loop but outside the
355+
// inner-loop.
341356
forwardSliceOptions.filter = [&](mlir::Operation *op) {
342357
return mlir::areValuesDefinedAbove(op->getResults(), innerLoop.getRegion());
343358
};
344359

345360
llvm::SetVector<mlir::Operation *> indVarSlice;
361+
// The forward slice of the `i` loop's IV will be the 2 ops in line 1 & 2
362+
// above. Uses of `%i_idx` inside the `j` loop are not collected because of
363+
// the filter.
346364
mlir::getForwardSlice(outerLoop.getInductionVar(), &indVarSlice,
347365
forwardSliceOptions);
348-
llvm::DenseSet<mlir::Operation *> innerLoopSetupOpsSet(indVarSlice.begin(),
349-
indVarSlice.end());
350-
351-
llvm::DenseSet<mlir::Operation *> loopBodySet;
366+
llvm::DenseSet<mlir::Operation *> indVarSet(indVarSlice.begin(),
367+
indVarSlice.end());
368+
369+
llvm::DenseSet<mlir::Operation *> outerLoopBodySet;
370+
// The following walk collects ops inside `outerLoop` that are **not**:
371+
// * the outer-loop itself,
372+
// * or the inner-loop,
373+
// * or the `fir.result` op (the outer-loop's terminator).
374+
//
375+
// For the above example, this will also populate `outerLoopBodySet` with ops
376+
// in line 1 & 2 since we skip the `i` loop, the `j` loop, and the terminator.
352377
outerLoop.walk<mlir::WalkOrder::PreOrder>([&](mlir::Operation *op) {
353378
if (op == outerLoop)
354379
return mlir::WalkResult::advance();
@@ -359,43 +384,48 @@ bool isPerfectlyNested(fir::DoLoopOp outerLoop, fir::DoLoopOp innerLoop) {
359384
if (mlir::isa<fir::ResultOp>(op))
360385
return mlir::WalkResult::advance();
361386

362-
loopBodySet.insert(op);
387+
outerLoopBodySet.insert(op);
363388
return mlir::WalkResult::advance();
364389
});
365390

366-
bool result = (loopBodySet == innerLoopSetupOpsSet);
391+
// If `outerLoopBodySet` ends up having the same ops as `indVarSet`, then
392+
// `outerLoop` only contains ops that setup its induction variable +
393+
// `innerLoop` + the `fir.result` terminator. In other words, `innerLoop` is
394+
// perfectly nested inside `outerLoop`.
395+
bool result = (outerLoopBodySet == indVarSet);
367396
mlir::Location loc = outerLoop.getLoc();
368397
LLVM_DEBUG(DBGS() << "Loop pair starting at location " << loc << " is"
369398
<< (result ? "" : " not") << " perfectly nested\n");
370399

371400
return result;
372401
}
373402

374-
/// Starting with `outerLoop` collect a perfectly nested loop nest, if any. This
375-
/// function collects as much as possible loops in the nest; it case it fails to
376-
/// recognize a certain nested loop as part of the nest it just returns the
377-
/// parent loops it discovered before.
403+
/// Starting with `currentLoop` collect a perfectly nested loop nest, if any.
404+
/// This function collects as much as possible loops in the nest; it case it
405+
/// fails to recognize a certain nested loop as part of the nest it just returns
406+
/// the parent loops it discovered before.
378407
mlir::LogicalResult collectLoopNest(fir::DoLoopOp currentLoop,
379408
LoopNestToIndVarMap &loopNest) {
380409
assert(currentLoop.getUnordered());
381410

382411
while (true) {
383-
loopNest.try_emplace(
384-
currentLoop,
385-
InductionVariableInfo{
386-
findLoopIndVarMemDecl(currentLoop),
387-
std::move(looputils::extractIndVarUpdateOps(currentLoop))});
388-
389-
auto directlyNestedLoops = currentLoop.getRegion().getOps<fir::DoLoopOp>();
412+
loopNest.insert(
413+
{currentLoop,
414+
InductionVariableInfo{
415+
findLoopIndVarMemDecl(currentLoop),
416+
std::move(looputils::extractIndVarUpdateOps(currentLoop))}});
390417
llvm::SmallVector<fir::DoLoopOp> unorderedLoops;
391418

392-
for (auto nestedLoop : directlyNestedLoops)
419+
for (auto nestedLoop : currentLoop.getRegion().getOps<fir::DoLoopOp>())
393420
if (nestedLoop.getUnordered())
394421
unorderedLoops.push_back(nestedLoop);
395422

396423
if (unorderedLoops.empty())
397424
break;
398425

426+
// Having more than one unordered loop means that we are not dealing with a
427+
// perfect loop nest (i.e. a mulit-range `do concurrent` loop); which is the
428+
// case we are after here.
399429
if (unorderedLoops.size() > 1)
400430
return mlir::failure();
401431

flang/test/Transforms/DoConcurrent/loop_nest_test.f90

+4-2
Original file line numberDiff line numberDiff line change
@@ -67,13 +67,15 @@ subroutine foo(n)
6767
end do
6868
end do
6969

70+
! Verify the (i,j) and (j,k) pairs of loops are detected as perfectly nested.
71+
!
72+
! CHECK: Loop pair starting at location
73+
! CHECK: loc("{{.*}}":[[# @LINE + 3]]:{{.*}}) is perfectly nested
7074
! CHECK: Loop pair starting at location
7175
! CHECK: loc("{{.*}}":[[# @LINE + 1]]:{{.*}}) is perfectly nested
7276
do concurrent(i=bar(n, x):n, j=1:bar(n*m, n/m), k=1:bar(n*m, bar(n*m, n/m)))
7377
a(i) = n
7478
end do
75-
76-
7779
end subroutine
7880

7981
pure function bar(n, m)

0 commit comments

Comments
 (0)