|
| 1 | +<!--===- docs/DoConcurrentMappingToOpenMP.md |
| 2 | +
|
| 3 | + Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. |
| 4 | + See https://llvm.org/LICENSE.txt for license information. |
| 5 | + SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception |
| 6 | +
|
| 7 | +--> |
| 8 | + |
| 9 | +# `DO CONCURRENT` mapping to OpenMP |
| 10 | + |
| 11 | +```{contents} |
| 12 | +--- |
| 13 | +local: |
| 14 | +--- |
| 15 | +``` |
| 16 | + |
| 17 | +This document seeks to describe the effort to parallelize `do concurrent` loops |
| 18 | +by mapping them to OpenMP worksharing constructs. The goals of this document |
| 19 | +are: |
| 20 | +* Describing how to instruct `flang` to map `DO CONCURRENT` loops to OpenMP |
| 21 | + constructs. |
| 22 | +* Tracking the current status of such mapping. |
| 23 | +* Describing the limitations of the current implementation. |
| 24 | +* Describing next steps. |
| 25 | +* Tracking the current upstreaming status (from the AMD ROCm fork). |
| 26 | + |
| 27 | +## Usage |
| 28 | + |
| 29 | +In order to enable `do concurrent` to OpenMP mapping, `flang` adds a new |
| 30 | +compiler flag: `-fdo-concurrent-to-openmp`. This flag has 3 possible values: |
| 31 | +1. `host`: this maps `do concurrent` loops to run in parallel on the host CPU. |
| 32 | + This maps such loops to the equivalent of `omp parallel do`. |
| 33 | +2. `device`: this maps `do concurrent` loops to run in parallel on a target device. |
| 34 | + This maps such loops to the equivalent of |
| 35 | + `omp target teams distribute parallel do`. |
| 36 | +3. `none`: this disables `do concurrent` mapping altogether. In that case, such |
| 37 | + loops are emitted as sequential loops. |
| 38 | + |
| 39 | +The `-fdo-concurrent-to-openmp` compiler switch is currently available only when |
| 40 | +OpenMP is also enabled. So you need to provide the following options to flang in |
| 41 | +order to enable it: |
| 42 | +``` |
| 43 | +flang ... -fopenmp -fdo-concurrent-to-openmp=[host|device|none] ... |
| 44 | +``` |
| 45 | +For mapping to device, the target device architecture must be specified as well. |
| 46 | +See `-fopenmp-targets` and `--offload-arch` for more info. |
| 47 | + |
| 48 | +## Current status |
| 49 | + |
| 50 | +Under the hood, `do concurrent` mapping is implemented in the |
| 51 | +`DoConcurrentConversionPass`. This is still an experimental pass which means |
| 52 | +that: |
| 53 | +* It has been tested in a very limited way so far. |
| 54 | +* It has been tested mostly on simple synthetic inputs. |
| 55 | + |
| 56 | +<!-- |
| 57 | +More details about current status will be added along with relevant parts of the |
| 58 | +implementation in later upstreaming patches. |
| 59 | +--> |
| 60 | + |
| 61 | +## Next steps |
| 62 | + |
| 63 | +This section describes some of the open questions/issues that are not tackled yet |
| 64 | +even in the downstream implementation. |
| 65 | + |
| 66 | +### Delayed privatization |
| 67 | + |
| 68 | +So far, we emit the privatization logic for IVs inline in the parallel/target |
| 69 | +region. This is enough for our purposes right now since we don't |
| 70 | +localize/privatize any sophisticated types of variables yet. Once we have need |
| 71 | +for more advanced localization through `do concurrent`'s locality specifiers |
| 72 | +(see below), delayed privatization will enable us to have a much cleaner IR. |
| 73 | +Once delayed privatization's implementation upstream is supported for the |
| 74 | +required constructs by the pass, we will move to it rather than inlined/early |
| 75 | +privatization. |
| 76 | + |
| 77 | +### Locality specifiers for `do concurrent` |
| 78 | + |
| 79 | +Locality specifiers will enable the user to control the data environment of the |
| 80 | +loop nest in a more fine-grained way. Implementing these specifiers on the |
| 81 | +`FIR` dialect level is needed in order to support this in the |
| 82 | +`DoConcurrentConversionPass`. |
| 83 | + |
| 84 | +Such specifiers will also unlock a potential solution to the |
| 85 | +non-perfectly-nested loops' IVs issue described above. In particular, for a |
| 86 | +non-perfectly nested loop, one middle-ground proposal/solution would be to: |
| 87 | +* Emit the loop's IV as shared/mapped just like we do currently. |
| 88 | +* Emit a warning that the IV of the loop is emitted as shared/mapped. |
| 89 | +* Given support for `LOCAL`, we can recommend the user to explicitly |
| 90 | + localize/privatize the loop's IV if they choose to. |
| 91 | + |
| 92 | +#### Sharing TableGen clause records from the OpenMP dialect |
| 93 | + |
| 94 | +At the moment, the FIR dialect does not have a way to model locality specifiers |
| 95 | +on the IR level. Instead, something similar to early/eager privatization in OpenMP |
| 96 | +is done for the locality specifiers in `fir.do_loop` ops. Having locality specifier |
| 97 | +modelled in a way similar to delayed privatization (i.e. the `omp.private` op) and |
| 98 | +reductions (i.e. the `omp.declare_reduction` op) can make mapping `do concurrent` |
| 99 | +to OpenMP (and other parallel programming models) much easier. |
| 100 | + |
| 101 | +Therefore, one way to approach this problem is to extract the TableGen records |
| 102 | +for relevant OpenMP clauses in a shared dialect for "data environment management" |
| 103 | +and use these shared records for OpenMP, `do concurrent`, and possibly OpenACC |
| 104 | +as well. |
| 105 | + |
| 106 | +#### Supporting reductions |
| 107 | + |
| 108 | +Similar to locality specifiers, mapping reductions from `do concurrent` to OpenMP |
| 109 | +is also still an open TODO. We can potentially extend the MLIR infrastructure |
| 110 | +proposed in the previous section to share reduction records among the different |
| 111 | +relevant dialects as well. |
| 112 | + |
| 113 | +### More advanced detection of loop nests |
| 114 | + |
| 115 | +As pointed out earlier, any intervening code between the headers of 2 nested |
| 116 | +`do concurrent` loops prevents us from detecting this as a loop nest. In some |
| 117 | +cases this is overly conservative. Therefore, a more flexible detection logic |
| 118 | +of loop nests needs to be implemented. |
| 119 | + |
| 120 | +### Data-dependence analysis |
| 121 | + |
| 122 | +Right now, we map loop nests without analysing whether such mapping is safe to |
| 123 | +do or not. We probably need to at least warn the user of unsafe loop nests due |
| 124 | +to loop-carried dependencies. |
| 125 | + |
| 126 | +### Non-rectangular loop nests |
| 127 | + |
| 128 | +So far, we did not need to use the pass for non-rectangular loop nests. For |
| 129 | +example: |
| 130 | +```fortran |
| 131 | +do concurrent(i=1:n) |
| 132 | + do concurrent(j=i:n) |
| 133 | + ... |
| 134 | + end do |
| 135 | +end do |
| 136 | +``` |
| 137 | +We defer this to the (hopefully) near future when we get the conversion in a |
| 138 | +good share for the samples/projects at hand. |
| 139 | + |
| 140 | +### Generalizing the pass to other parallel programming models |
| 141 | + |
| 142 | +Once we have a stable and capable `do concurrent` to OpenMP mapping, we can take |
| 143 | +this in a more generalized direction and allow the pass to target other models; |
| 144 | +e.g. OpenACC. This goal should be kept in mind from the get-go even while only |
| 145 | +targeting OpenMP. |
| 146 | + |
| 147 | + |
| 148 | +## Upstreaming status |
| 149 | + |
| 150 | +- [x] Command line options for `flang` and `bbc`. |
| 151 | +- [x] Conversion pass skeleton (no transormations happen yet). |
| 152 | +- [x] Status description and tracking document (this document). |
| 153 | +- [ ] Basic host/CPU mapping support. |
| 154 | +- [ ] Basic device/GPU mapping support. |
| 155 | +- [ ] More advanced host and device support (expaned to multiple items as needed). |
0 commit comments