Skip to content

Commit 7db8234

Browse files
committed
[flang][OpenMP] Handle "loop-local values" in do concurrent nests
Extends `do concurrent` mapping to handle "loop-local values". A loop-local value is one that is used exclusively inside the loop but allocated outside of it. This usually corresponds to temporary values that are used inside the loop body for initialzing other variables for example. After collecting these values, the pass localizes them to the loop nest by moving their allocations.
1 parent ef56b53 commit 7db8234

File tree

3 files changed

+180
-1
lines changed

3 files changed

+180
-1
lines changed

flang/docs/DoConcurrentConversionToOpenMP.md

+51
Original file line numberDiff line numberDiff line change
@@ -202,6 +202,57 @@ variables: `i` and `j`. These are locally allocated inside the parallel/target
202202
OpenMP region similar to what the single-range example in previous section
203203
shows.
204204

205+
### Data environment
206+
207+
By default, variables that are used inside a `do concurrent` loop nest are
208+
either treated as `shared` in case of mapping to `host`, or mapped into the
209+
`target` region using a `map` clause in case of mapping to `device`. The only
210+
exceptions to this are:
211+
1. the loop's iteration variable(s) (IV) of **perfect** loop nests. In that
212+
case, for each IV, we allocate a local copy as shown by the mapping
213+
examples above.
214+
1. any values that are from allocations outside the loop nest and used
215+
exclusively inside of it. In such cases, a local privatized
216+
copy is created in the OpenMP region to prevent multiple teams of threads
217+
from accessing and destroying the same memory block, which causes runtime
218+
issues. For an example of such cases, see
219+
`flang/test/Transforms/DoConcurrent/locally_destroyed_temp.f90`.
220+
221+
Implicit mapping detection (for mapping to the target device) is still quite
222+
limited and work to make it smarter is underway for both OpenMP in general
223+
and `do concurrent` mapping.
224+
225+
#### Non-perfectly-nested loops' IVs
226+
227+
For non-perfectly-nested loops, the IVs are still treated as `shared` or
228+
`map` entries as pointed out above. This **might not** be consistent with what
229+
the Fortran specification tells us. In particular, taking the following
230+
snippets from the spec (version 2023) into account:
231+
232+
> § 3.35
233+
> ------
234+
> construct entity
235+
> entity whose identifier has the scope of a construct
236+
237+
> § 19.4
238+
> ------
239+
> A variable that appears as an index-name in a FORALL or DO CONCURRENT
240+
> construct [...] is a construct entity. A variable that has LOCAL or
241+
> LOCAL_INIT locality in a DO CONCURRENT construct is a construct entity.
242+
> [...]
243+
> The name of a variable that appears as an index-name in a DO CONCURRENT
244+
> construct, FORALL statement, or FORALL construct has a scope of the statement
245+
> or construct. A variable that has LOCAL or LOCAL_INIT locality in a DO
246+
> CONCURRENT construct has the scope of that construct.
247+
248+
From the above quotes, it seems there is an equivalence between the IV of a `do
249+
concurrent` loop and a variable with a `LOCAL` locality specifier (equivalent
250+
to OpenMP's `private` clause). Which means that we should probably
251+
localize/privatize a `do concurrent` loop's IV even if it is not perfectly
252+
nested in the nest we are parallelizing. For now, however, we **do not** do
253+
that as pointed out previously. In the near future, we propose a middle-ground
254+
solution (see the Next steps section for more details).
255+
205256
<!--
206257
More details about current status will be added along with relevant parts of the
207258
implementation in later upstreaming patches.

flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp

+67-1
Original file line numberDiff line numberDiff line change
@@ -313,6 +313,64 @@ void sinkLoopIVArgs(mlir::ConversionPatternRewriter &rewriter,
313313
++idx;
314314
}
315315
}
316+
317+
/// Collects values that are local to a loop: "loop-local values". A loop-local
318+
/// value is one that is used exclusively inside the loop but allocated outside
319+
/// of it. This usually corresponds to temporary values that are used inside the
320+
/// loop body for initialzing other variables for example.
321+
///
322+
/// See `flang/test/Transforms/DoConcurrent/locally_destroyed_temp.f90` for an
323+
/// example of why we need this.
324+
///
325+
/// \param [in] doLoop - the loop within which the function searches for values
326+
/// used exclusively inside.
327+
///
328+
/// \param [out] locals - the list of loop-local values detected for \p doLoop.
329+
void collectLoopLocalValues(fir::DoLoopOp doLoop,
330+
llvm::SetVector<mlir::Value> &locals) {
331+
doLoop.walk([&](mlir::Operation *op) {
332+
for (mlir::Value operand : op->getOperands()) {
333+
if (locals.contains(operand))
334+
continue;
335+
336+
bool isLocal = true;
337+
338+
if (!mlir::isa_and_present<fir::AllocaOp>(operand.getDefiningOp()))
339+
continue;
340+
341+
// Values defined inside the loop are not interesting since they do not
342+
// need to be localized.
343+
if (doLoop->isAncestor(operand.getDefiningOp()))
344+
continue;
345+
346+
for (auto *user : operand.getUsers()) {
347+
if (!doLoop->isAncestor(user)) {
348+
isLocal = false;
349+
break;
350+
}
351+
}
352+
353+
if (isLocal)
354+
locals.insert(operand);
355+
}
356+
});
357+
}
358+
359+
/// For a "loop-local" value \p local within a loop's scope, localizes that
360+
/// value within the scope of the parallel region the loop maps to. Towards that
361+
/// end, this function moves the allocation of \p local within \p allocRegion.
362+
///
363+
/// \param local - the value used exclusively within a loop's scope (see
364+
/// collectLoopLocalValues).
365+
///
366+
/// \param allocRegion - the parallel region where \p local's allocation will be
367+
/// privatized.
368+
///
369+
/// \param rewriter - builder used for updating \p allocRegion.
370+
static void localizeLoopLocalValue(mlir::Value local, mlir::Region &allocRegion,
371+
mlir::ConversionPatternRewriter &rewriter) {
372+
rewriter.moveOpBefore(local.getDefiningOp(), &allocRegion.front().front());
373+
}
316374
} // namespace looputils
317375

318376
class DoConcurrentConversion : public mlir::OpConversionPattern<fir::DoLoopOp> {
@@ -339,13 +397,21 @@ class DoConcurrentConversion : public mlir::OpConversionPattern<fir::DoLoopOp> {
339397
"Some `do concurent` loops are not perfectly-nested. "
340398
"These will be serialized.");
341399

400+
llvm::SetVector<mlir::Value> locals;
401+
looputils::collectLoopLocalValues(loopNest.back().first, locals);
342402
looputils::sinkLoopIVArgs(rewriter, loopNest);
403+
343404
mlir::IRMapping mapper;
344-
genParallelOp(doLoop.getLoc(), rewriter, loopNest, mapper);
405+
mlir::omp::ParallelOp parallelOp =
406+
genParallelOp(doLoop.getLoc(), rewriter, loopNest, mapper);
345407
mlir::omp::LoopNestOperands loopNestClauseOps;
346408
genLoopNestClauseOps(doLoop.getLoc(), rewriter, loopNest, mapper,
347409
loopNestClauseOps);
348410

411+
for (mlir::Value local : locals)
412+
looputils::localizeLoopLocalValue(local, parallelOp.getRegion(),
413+
rewriter);
414+
349415
mlir::omp::LoopNestOp ompLoopNest =
350416
genWsLoopOp(rewriter, loopNest.back().first, mapper, loopNestClauseOps,
351417
/*isComposite=*/mapToDevice);
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
! Tests that "loop-local values" are properly handled by localizing them to the
2+
! body of the loop nest. See `collectLoopLocalValues` and `localizeLoopLocalValue`
3+
! for a definition of "loop-local values" and how they are handled.
4+
5+
! RUN: %flang_fc1 -emit-hlfir -fopenmp -fdo-concurrent-to-openmp=host %s -o - \
6+
! RUN: | FileCheck %s
7+
module struct_mod
8+
type test_struct
9+
integer, allocatable :: x_
10+
end type
11+
12+
interface test_struct
13+
pure module function construct_from_components(x) result(struct)
14+
implicit none
15+
integer, intent(in) :: x
16+
type(test_struct) struct
17+
end function
18+
end interface
19+
end module
20+
21+
submodule(struct_mod) struct_sub
22+
implicit none
23+
24+
contains
25+
module procedure construct_from_components
26+
struct%x_ = x
27+
end procedure
28+
end submodule struct_sub
29+
30+
program main
31+
use struct_mod, only : test_struct
32+
33+
implicit none
34+
type(test_struct), dimension(10) :: a
35+
integer :: i
36+
integer :: total
37+
38+
do concurrent (i=1:10)
39+
a(i) = test_struct(i)
40+
end do
41+
42+
do i=1,10
43+
total = total + a(i)%x_
44+
end do
45+
46+
print *, "total =", total
47+
end program main
48+
49+
! CHECK: omp.parallel {
50+
! CHECK: %[[LOCAL_TEMP:.*]] = fir.alloca !fir.type<_QMstruct_modTtest_struct{x_:!fir.box<!fir.heap<i32>>}> {bindc_name = ".result"}
51+
! CHECK: omp.wsloop {
52+
! CHECK: omp.loop_nest {{.*}} {
53+
! CHECK: %[[TEMP_VAL:.*]] = fir.call @_QMstruct_modPconstruct_from_components
54+
! CHECK: fir.save_result %[[TEMP_VAL]] to %[[LOCAL_TEMP]]
55+
! CHECK: %[[EMBOXED_LOCAL:.*]] = fir.embox %[[LOCAL_TEMP]]
56+
! CHECK: %[[CONVERTED_LOCAL:.*]] = fir.convert %[[EMBOXED_LOCAL]]
57+
! CHECK: fir.call @_FortranADestroy(%[[CONVERTED_LOCAL]])
58+
! CHECK: omp.yield
59+
! CHECK: }
60+
! CHECK: }
61+
! CHECK: omp.terminator
62+
! CHECK: }

0 commit comments

Comments
 (0)