Skip to content

Commit 5d36448

Browse files
authored
[flang][OpenMP] Upstream first part of do concurrent mapping (#126026)
This PR starts the effort to upstream AMD's internal implementation of `do concurrent` to OpenMP mapping. This replaces #77285 since we extended this WIP quite a bit on our fork over the past year. An important part of this PR is a document that describes the current status downstream, the upstreaming status, and next steps to make this pass much more useful. In addition to this document, this PR also contains the skeleton of the pass (no useful transformations are done yet) and some testing for the added command line options. This looks like a huge PR but a lot of the added stuff is documentation. It is also worth noting that the downstream pass has been validated on https://github.com/BerkeleyLab/fiats. For the CPU mapping, this achived performance speed-ups that match pure OpenMP, for GPU mapping we are still working on extending our support for implicit memory mapping and locality specifiers. PR stack: - #126026 (this PR) - #127595 - #127633 - #127634 - #127635
1 parent 730e8a4 commit 5d36448

File tree

18 files changed

+506
-12
lines changed

18 files changed

+506
-12
lines changed

clang/include/clang/Driver/Options.td

+4
Original file line numberDiff line numberDiff line change
@@ -6976,6 +6976,10 @@ defm loop_versioning : BoolOptionWithoutMarshalling<"f", "version-loops-for-stri
69766976

69776977
def fhermetic_module_files : Flag<["-"], "fhermetic-module-files">, Group<f_Group>,
69786978
HelpText<"Emit hermetic module files (no nested USE association)">;
6979+
6980+
def fdo_concurrent_to_openmp_EQ : Joined<["-"], "fdo-concurrent-to-openmp=">,
6981+
HelpText<"Try to map `do concurrent` loops to OpenMP [none|host|device]">,
6982+
Values<"none, host, device">;
69796983
} // let Visibility = [FC1Option, FlangOption]
69806984

69816985
def J : JoinedOrSeparate<["-"], "J">,

clang/lib/Driver/ToolChains/Flang.cpp

+2-1
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,8 @@ void Flang::addCodegenOptions(const ArgList &Args,
158158
CmdArgs.push_back("-fversion-loops-for-stride");
159159

160160
Args.addAllArgs(CmdArgs,
161-
{options::OPT_flang_experimental_hlfir,
161+
{options::OPT_fdo_concurrent_to_openmp_EQ,
162+
options::OPT_flang_experimental_hlfir,
162163
options::OPT_flang_deprecated_no_hlfir,
163164
options::OPT_fno_ppc_native_vec_elem_order,
164165
options::OPT_fppc_native_vec_elem_order,
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
<!--===- docs/DoConcurrentMappingToOpenMP.md
2+
3+
Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4+
See https://llvm.org/LICENSE.txt for license information.
5+
SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6+
7+
-->
8+
9+
# `DO CONCURRENT` mapping to OpenMP
10+
11+
```{contents}
12+
---
13+
local:
14+
---
15+
```
16+
17+
This document seeks to describe the effort to parallelize `do concurrent` loops
18+
by mapping them to OpenMP worksharing constructs. The goals of this document
19+
are:
20+
* Describing how to instruct `flang` to map `DO CONCURRENT` loops to OpenMP
21+
constructs.
22+
* Tracking the current status of such mapping.
23+
* Describing the limitations of the current implementation.
24+
* Describing next steps.
25+
* Tracking the current upstreaming status (from the AMD ROCm fork).
26+
27+
## Usage
28+
29+
In order to enable `do concurrent` to OpenMP mapping, `flang` adds a new
30+
compiler flag: `-fdo-concurrent-to-openmp`. This flag has 3 possible values:
31+
1. `host`: this maps `do concurrent` loops to run in parallel on the host CPU.
32+
This maps such loops to the equivalent of `omp parallel do`.
33+
2. `device`: this maps `do concurrent` loops to run in parallel on a target device.
34+
This maps such loops to the equivalent of
35+
`omp target teams distribute parallel do`.
36+
3. `none`: this disables `do concurrent` mapping altogether. In that case, such
37+
loops are emitted as sequential loops.
38+
39+
The `-fdo-concurrent-to-openmp` compiler switch is currently available only when
40+
OpenMP is also enabled. So you need to provide the following options to flang in
41+
order to enable it:
42+
```
43+
flang ... -fopenmp -fdo-concurrent-to-openmp=[host|device|none] ...
44+
```
45+
For mapping to device, the target device architecture must be specified as well.
46+
See `-fopenmp-targets` and `--offload-arch` for more info.
47+
48+
## Current status
49+
50+
Under the hood, `do concurrent` mapping is implemented in the
51+
`DoConcurrentConversionPass`. This is still an experimental pass which means
52+
that:
53+
* It has been tested in a very limited way so far.
54+
* It has been tested mostly on simple synthetic inputs.
55+
56+
<!--
57+
More details about current status will be added along with relevant parts of the
58+
implementation in later upstreaming patches.
59+
-->
60+
61+
## Next steps
62+
63+
This section describes some of the open questions/issues that are not tackled yet
64+
even in the downstream implementation.
65+
66+
### Delayed privatization
67+
68+
So far, we emit the privatization logic for IVs inline in the parallel/target
69+
region. This is enough for our purposes right now since we don't
70+
localize/privatize any sophisticated types of variables yet. Once we have need
71+
for more advanced localization through `do concurrent`'s locality specifiers
72+
(see below), delayed privatization will enable us to have a much cleaner IR.
73+
Once delayed privatization's implementation upstream is supported for the
74+
required constructs by the pass, we will move to it rather than inlined/early
75+
privatization.
76+
77+
### Locality specifiers for `do concurrent`
78+
79+
Locality specifiers will enable the user to control the data environment of the
80+
loop nest in a more fine-grained way. Implementing these specifiers on the
81+
`FIR` dialect level is needed in order to support this in the
82+
`DoConcurrentConversionPass`.
83+
84+
Such specifiers will also unlock a potential solution to the
85+
non-perfectly-nested loops' IVs issue described above. In particular, for a
86+
non-perfectly nested loop, one middle-ground proposal/solution would be to:
87+
* Emit the loop's IV as shared/mapped just like we do currently.
88+
* Emit a warning that the IV of the loop is emitted as shared/mapped.
89+
* Given support for `LOCAL`, we can recommend the user to explicitly
90+
localize/privatize the loop's IV if they choose to.
91+
92+
#### Sharing TableGen clause records from the OpenMP dialect
93+
94+
At the moment, the FIR dialect does not have a way to model locality specifiers
95+
on the IR level. Instead, something similar to early/eager privatization in OpenMP
96+
is done for the locality specifiers in `fir.do_loop` ops. Having locality specifier
97+
modelled in a way similar to delayed privatization (i.e. the `omp.private` op) and
98+
reductions (i.e. the `omp.declare_reduction` op) can make mapping `do concurrent`
99+
to OpenMP (and other parallel programming models) much easier.
100+
101+
Therefore, one way to approach this problem is to extract the TableGen records
102+
for relevant OpenMP clauses in a shared dialect for "data environment management"
103+
and use these shared records for OpenMP, `do concurrent`, and possibly OpenACC
104+
as well.
105+
106+
#### Supporting reductions
107+
108+
Similar to locality specifiers, mapping reductions from `do concurrent` to OpenMP
109+
is also still an open TODO. We can potentially extend the MLIR infrastructure
110+
proposed in the previous section to share reduction records among the different
111+
relevant dialects as well.
112+
113+
### More advanced detection of loop nests
114+
115+
As pointed out earlier, any intervening code between the headers of 2 nested
116+
`do concurrent` loops prevents us from detecting this as a loop nest. In some
117+
cases this is overly conservative. Therefore, a more flexible detection logic
118+
of loop nests needs to be implemented.
119+
120+
### Data-dependence analysis
121+
122+
Right now, we map loop nests without analysing whether such mapping is safe to
123+
do or not. We probably need to at least warn the user of unsafe loop nests due
124+
to loop-carried dependencies.
125+
126+
### Non-rectangular loop nests
127+
128+
So far, we did not need to use the pass for non-rectangular loop nests. For
129+
example:
130+
```fortran
131+
do concurrent(i=1:n)
132+
do concurrent(j=i:n)
133+
...
134+
end do
135+
end do
136+
```
137+
We defer this to the (hopefully) near future when we get the conversion in a
138+
good share for the samples/projects at hand.
139+
140+
### Generalizing the pass to other parallel programming models
141+
142+
Once we have a stable and capable `do concurrent` to OpenMP mapping, we can take
143+
this in a more generalized direction and allow the pass to target other models;
144+
e.g. OpenACC. This goal should be kept in mind from the get-go even while only
145+
targeting OpenMP.
146+
147+
148+
## Upstreaming status
149+
150+
- [x] Command line options for `flang` and `bbc`.
151+
- [x] Conversion pass skeleton (no transormations happen yet).
152+
- [x] Status description and tracking document (this document).
153+
- [ ] Basic host/CPU mapping support.
154+
- [ ] Basic device/GPU mapping support.
155+
- [ ] More advanced host and device support (expaned to multiple items as needed).

flang/docs/index.md

+1
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,7 @@ on how to get in touch with us and to learn more about the current status.
5151
DebugGeneration
5252
Directives
5353
DoConcurrent
54+
DoConcurrentConversionToOpenMP
5455
Extensions
5556
F202X
5657
FIRArrayOperations

flang/include/flang/Frontend/CodeGenOptions.def

+2
Original file line numberDiff line numberDiff line change
@@ -43,5 +43,7 @@ ENUM_CODEGENOPT(DebugInfo, llvm::codegenoptions::DebugInfoKind, 4, llvm::codeg
4343
ENUM_CODEGENOPT(VecLib, llvm::driver::VectorLibrary, 3, llvm::driver::VectorLibrary::NoLibrary) ///< Vector functions library to use
4444
ENUM_CODEGENOPT(FramePointer, llvm::FramePointerKind, 2, llvm::FramePointerKind::None) ///< Enable the usage of frame pointers
4545

46+
ENUM_CODEGENOPT(DoConcurrentMapping, DoConcurrentMappingKind, 2, DoConcurrentMappingKind::DCMK_None) ///< Map `do concurrent` to OpenMP
47+
4648
#undef CODEGENOPT
4749
#undef ENUM_CODEGENOPT

flang/include/flang/Frontend/CodeGenOptions.h

+5
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
#ifndef FORTRAN_FRONTEND_CODEGENOPTIONS_H
1616
#define FORTRAN_FRONTEND_CODEGENOPTIONS_H
1717

18+
#include "flang/Optimizer/OpenMP/Utils.h"
1819
#include "llvm/Frontend/Debug/Options.h"
1920
#include "llvm/Frontend/Driver/CodeGenOptions.h"
2021
#include "llvm/Support/CodeGen.h"
@@ -143,6 +144,10 @@ class CodeGenOptions : public CodeGenOptionsBase {
143144
/// (-mlarge-data-threshold).
144145
uint64_t LargeDataThreshold;
145146

147+
/// Optionally map `do concurrent` loops to OpenMP. This is only valid of
148+
/// OpenMP is enabled.
149+
using DoConcurrentMappingKind = flangomp::DoConcurrentMappingKind;
150+
146151
// Define accessors/mutators for code generation options of enumeration type.
147152
#define CODEGENOPT(Name, Bits, Default)
148153
#define ENUM_CODEGENOPT(Name, Type, Bits, Default) \

flang/include/flang/Optimizer/OpenMP/Passes.h

+2
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
#ifndef FORTRAN_OPTIMIZER_OPENMP_PASSES_H
1414
#define FORTRAN_OPTIMIZER_OPENMP_PASSES_H
1515

16+
#include "flang/Optimizer/OpenMP/Utils.h"
1617
#include "mlir/Dialect/Func/IR/FuncOps.h"
1718
#include "mlir/IR/BuiltinOps.h"
1819
#include "mlir/Pass/Pass.h"
@@ -30,6 +31,7 @@ namespace flangomp {
3031
/// divided into units of work.
3132
bool shouldUseWorkshareLowering(mlir::Operation *op);
3233

34+
std::unique_ptr<mlir::Pass> createDoConcurrentConversionPass(bool mapToDevice);
3335
} // namespace flangomp
3436

3537
#endif // FORTRAN_OPTIMIZER_OPENMP_PASSES_H

flang/include/flang/Optimizer/OpenMP/Passes.td

+30
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,36 @@ def FunctionFilteringPass : Pass<"omp-function-filtering"> {
5050
];
5151
}
5252

53+
def DoConcurrentConversionPass : Pass<"omp-do-concurrent-conversion", "mlir::func::FuncOp"> {
54+
let summary = "Map `DO CONCURRENT` loops to OpenMP worksharing loops.";
55+
56+
let description = [{ This is an experimental pass to map `DO CONCURRENT` loops
57+
to their correspnding equivalent OpenMP worksharing constructs.
58+
59+
For now the following is supported:
60+
- Mapping simple loops to `parallel do`.
61+
62+
Still TODO:
63+
- More extensive testing.
64+
}];
65+
66+
let dependentDialects = ["mlir::omp::OpenMPDialect"];
67+
68+
let options = [
69+
Option<"mapTo", "map-to",
70+
"flangomp::DoConcurrentMappingKind",
71+
/*default=*/"flangomp::DoConcurrentMappingKind::DCMK_None",
72+
"Try to map `do concurrent` loops to OpenMP [none|host|device]",
73+
[{::llvm::cl::values(
74+
clEnumValN(flangomp::DoConcurrentMappingKind::DCMK_None,
75+
"none", "Do not lower `do concurrent` to OpenMP"),
76+
clEnumValN(flangomp::DoConcurrentMappingKind::DCMK_Host,
77+
"host", "Lower to run in parallel on the CPU"),
78+
clEnumValN(flangomp::DoConcurrentMappingKind::DCMK_Device,
79+
"device", "Lower to run in parallel on the GPU")
80+
)}]>,
81+
];
82+
}
5383

5484
// Needs to be scheduled on Module as we create functions in it
5585
def LowerWorkshare : Pass<"lower-workshare", "::mlir::ModuleOp"> {
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
//===-- Optimizer/OpenMP/Utils.h --------------------------------*- C++ -*-===//
2+
//
3+
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
4+
// See https://llvm.org/LICENSE.txt for license information.
5+
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
6+
//
7+
//===----------------------------------------------------------------------===//
8+
//
9+
// Coding style: https://mlir.llvm.org/getting_started/DeveloperGuide/
10+
//
11+
//===----------------------------------------------------------------------===//
12+
13+
#ifndef FORTRAN_OPTIMIZER_OPENMP_UTILS_H
14+
#define FORTRAN_OPTIMIZER_OPENMP_UTILS_H
15+
16+
namespace flangomp {
17+
18+
enum class DoConcurrentMappingKind {
19+
DCMK_None, ///< Do not lower `do concurrent` to OpenMP.
20+
DCMK_Host, ///< Lower to run in parallel on the CPU.
21+
DCMK_Device ///< Lower to run in parallel on the GPU.
22+
};
23+
24+
} // namespace flangomp
25+
26+
#endif // FORTRAN_OPTIMIZER_OPENMP_UTILS_H

flang/include/flang/Optimizer/Passes/Pipelines.h

+15-3
Original file line numberDiff line numberDiff line change
@@ -128,16 +128,28 @@ void createHLFIRToFIRPassPipeline(
128128
mlir::PassManager &pm, bool enableOpenMP,
129129
llvm::OptimizationLevel optLevel = defaultOptLevel);
130130

131+
struct OpenMPFIRPassPipelineOpts {
132+
/// Whether code is being generated for a target device rather than the host
133+
/// device
134+
bool isTargetDevice;
135+
136+
/// Controls how to map `do concurrent` loops; to device, host, or none at
137+
/// all.
138+
Fortran::frontend::CodeGenOptions::DoConcurrentMappingKind
139+
doConcurrentMappingKind;
140+
};
141+
131142
/// Create a pass pipeline for handling certain OpenMP transformations needed
132143
/// prior to FIR lowering.
133144
///
134145
/// WARNING: These passes must be run immediately after the lowering to ensure
135146
/// that the FIR is correct with respect to OpenMP operations/attributes.
136147
///
137148
/// \param pm - MLIR pass manager that will hold the pipeline definition.
138-
/// \param isTargetDevice - Whether code is being generated for a target device
139-
/// rather than the host device.
140-
void createOpenMPFIRPassPipeline(mlir::PassManager &pm, bool isTargetDevice);
149+
/// \param opts - options to control OpenMP code-gen; see struct docs for more
150+
/// details.
151+
void createOpenMPFIRPassPipeline(mlir::PassManager &pm,
152+
OpenMPFIRPassPipelineOpts opts);
141153

142154
#if !defined(FLANG_EXCLUDE_CODEGEN)
143155
void createDebugPasses(mlir::PassManager &pm,

flang/lib/Frontend/CompilerInvocation.cpp

+28
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,32 @@ static bool parseDebugArgs(Fortran::frontend::CodeGenOptions &opts,
158158
return true;
159159
}
160160

161+
static void parseDoConcurrentMapping(Fortran::frontend::CodeGenOptions &opts,
162+
llvm::opt::ArgList &args,
163+
clang::DiagnosticsEngine &diags) {
164+
llvm::opt::Arg *arg =
165+
args.getLastArg(clang::driver::options::OPT_fdo_concurrent_to_openmp_EQ);
166+
if (!arg)
167+
return;
168+
169+
using DoConcurrentMappingKind =
170+
Fortran::frontend::CodeGenOptions::DoConcurrentMappingKind;
171+
std::optional<DoConcurrentMappingKind> val =
172+
llvm::StringSwitch<std::optional<DoConcurrentMappingKind>>(
173+
arg->getValue())
174+
.Case("none", DoConcurrentMappingKind::DCMK_None)
175+
.Case("host", DoConcurrentMappingKind::DCMK_Host)
176+
.Case("device", DoConcurrentMappingKind::DCMK_Device)
177+
.Default(std::nullopt);
178+
179+
if (!val.has_value()) {
180+
diags.Report(clang::diag::err_drv_invalid_value)
181+
<< arg->getAsString(args) << arg->getValue();
182+
}
183+
184+
opts.setDoConcurrentMapping(val.value());
185+
}
186+
161187
static bool parseVectorLibArg(Fortran::frontend::CodeGenOptions &opts,
162188
llvm::opt::ArgList &args,
163189
clang::DiagnosticsEngine &diags) {
@@ -433,6 +459,8 @@ static void parseCodeGenArgs(Fortran::frontend::CodeGenOptions &opts,
433459
clang::driver::options::OPT_funderscoring, false)) {
434460
opts.Underscoring = 0;
435461
}
462+
463+
parseDoConcurrentMapping(opts, args, diags);
436464
}
437465

438466
/// Parses all target input arguments and populates the target

0 commit comments

Comments
 (0)