-
Notifications
You must be signed in to change notification settings - Fork 52
feat[next][dace]: Dace fieldview transformations #1594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
dcf3eab
7e6909e
2b07cc5
2370fa6
8e801df
812a6e5
1d0b50b
97a1d22
a30cc7d
f595b01
a6bcb6c
738da27
e5494d8
e9455e3
cbf55de
9d5b1ed
801704b
c45c417
3f26d91
f173244
d67518a
783542f
1fa9de4
96338c2
57e369f
ec4714c
46cb6c6
c20a94d
d1f7432
c4385c1
ed16fd4
9f7176f
4f40f42
932db7c
46febb0
949bad7
8890f95
87b71a6
665a609
82fdf64
e4718b0
47fcabe
51aaf0f
6ccecf1
e66b960
7f89a16
4c190bd
6052de2
7300864
55adbd5
ad21dc4
bb9123b
0025d77
9926d7d
172f19e
b1f4a47
63e6e92
107e295
3c71efa
e369cac
d0bd277
afb5ed1
2f75cfb
695db7c
074f0b2
085f307
f19960b
841040e
b3df358
dc1434c
eacde66
138a33c
3769fb5
b1b5887
a5b0f41
f7ac3d8
01ff262
c61e796
5a457b2
f4d9d89
9318011
11efdeb
25b9048
2dc6f97
f6e5b7c
d7312fa
628c18b
c4f2738
0fd0b65
38d2720
d3541c1
e855ef9
5726509
e44f3a2
4cff071
f642e85
f216a36
a2af8cd
74bd468
667eb7e
58b8e58
e898b31
eae968f
defb55d
fc9661c
e424d4e
7ef1d56
563ee1a
0dc376e
9df80ad
f32fd38
538abff
a07fe81
fec054a
ea7bf64
b291152
008209d
b832aca
94ab9d7
04cde84
3dd0860
3178b71
66c5fcd
d5abad4
1df1bc3
2032b60
6394243
fcd8ee3
a25a6a4
a57e108
62ad165
7f72794
abf3918
a6d31fb
4a2ccaa
d353d0e
073065d
19be2c4
5237b13
489bb4a
2f6274e
4d1a3cc
2da7453
ba97fd2
9301dbe
7f60cfe
699a88b
b3131db
130c877
7fbd7e1
fb2ba90
bf06cb4
52c1d01
849900f
42f4aba
84b2ba7
24bde91
fd81e75
284b6a8
a6b191c
033db6b
32712ea
ebb76de
0fddb8d
c4f64a4
6143b95
00aa64c
b67f0c0
b447c2a
090f08d
04dd63a
bb34f44
88f5245
73a01c1
dd1242c
f3798f3
863bd5f
ea0da2b
a95bda2
18a8560
5d979c9
bcd63d3
c165f9f
316ba9c
a796766
882ad44
a0bf263
37392fd
5c92c76
f4f5ae5
6d12757
e590a07
7e99d98
bd35c6d
0da8ae2
f978ef7
888fb55
0767d6f
c396200
8e471f1
28fcb84
e8829c6
9373629
63a5112
5a2c12c
bcfbd68
03f4b1a
fd2366f
5ed2a8f
368c8ad
dbc3874
8e97cd6
390f02b
46c549b
27d8ea6
1a1a705
3c4523a
cbed51a
ca71735
83a5fe4
0ee90f5
5d875fd
57ae4ea
4d2e941
243bc8e
a74a54d
b7400a6
36a6386
d6cde5c
b616189
32d3883
201c8e2
210a8d9
017fc9f
20da858
8c31694
c8ecd25
7aed88f
2a8494a
87d3ae5
2cfbe20
3e7d09f
7cb5e35
04b652e
ba0ecdc
e4df5ae
7265ecc
5ab199d
a0866a7
71c6681
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
--- | ||
tags: [backend, dace, optimization] | ||
--- | ||
|
||
# Canonical Form of an SDFG in GT4Py (Especially for Optimizations) | ||
|
||
- **Status**: valid | ||
- **Authors**: Philip Müller (@philip-paul-mueller) | ||
- **Created**: 2024-08-27 | ||
|
||
In the context of the implementation of the new DaCe fieldview we decided about a particular form of the SDFG. | ||
Their main intent is to reduce the complexity of the GT4Py specific transformations. | ||
|
||
## Context | ||
|
||
The canonical form outlined in this document was mainly designed from the perspective of the optimization pipeline. | ||
Thus it emphasizes a form that can be handled in a simple and efficient way by a transformation. | ||
In the pipeline we distinguish between: | ||
|
||
- Intrastate optimization: optimization of the data flow within states. | ||
- Interstate optimization: optimization between states, these are transformations that are _intended_ to _reduce_ the number of states. | ||
|
||
The current (GT4Py) pipeline mainly focus on intrastate optimization and relays on DaCe, especially its simplify pass, for interstate optimizations. | ||
|
||
## Decision | ||
|
||
The canonical form is defined by several rules that affect different aspects of an SDFG and what a transformation can assume. | ||
This allows simplifying the implementation of certain transformations. | ||
|
||
#### General Aspects | ||
|
||
The following rules especially affects transformations and how they operate: | ||
|
||
1. Intrastate transformation and interstate transformations must run separately and can not be mixed in the same (DaCe) pipeline. | ||
|
||
- [Rationale]: As a consequence the number of "interstate transients" (transients that are used in multiple states) remains constant during intrastate transformations. | ||
- [Note 1]: It is allowed to run them one after another, as long as they are strictly separated. | ||
- [Note 2]: It is allowed for an _intrastate_ transformation to act in a way that allows state fusion by later intrastate transformations. | ||
- [Note 3]: The DaCe simplification pass violates this rule, for that reason this pass must always be called on its own, see also rule 2. | ||
|
||
2. It is invalid to call the simplification pass directly, i.e. the usage of `SDFG.simplify()` is not allowed. The only valid way to call _simplify()_ is to call the `gt_simplify()` function provided by GT4Py. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I propose to apply this rule in the current PR; I mean you can replace the call to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The file you mentioned does not seem to exists. But I have removed all occurrences of simplify. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It exists in the open PR. Ok, we can call the optimization in the next PR. |
||
- [Rationale]: It was observed that some sub-passes in _simplify()_ have a negative impact and that additional passes might be needed in the future. | ||
By using a single function later modifications to _simplify()_ are easy. | ||
- [Note]: One issue is that the remove redundant array transformation is not able to handle all cases. | ||
|
||
#### Global Memory | ||
|
||
The only restriction we impose on global memory is: | ||
|
||
3. The same global memory is allowed to be used as input and output at the same time, if and only if the output depends _elementwise_ on the input. | ||
- [Rationale 1]: This allows the removal of double buffering, that DaCe may not remove. See also rule 2. | ||
- [Rationale 2]: This formulation allows writing expressions such as `a += 1`, with only memory for `a`. | ||
Phrased more technically, using global memory for input and output is allowed if and only if the two computations `tmp = computation(global_memory); global_memory = tmp;` and `global_memory = computation(global_memory);` are equivalent. | ||
- [Note]: In the long term this rule will be changed to: Global memory (an array) is either used as input (only read from) or as output (only written to) but never for both. | ||
|
||
#### State Machine | ||
|
||
For the SDFG state machine we assume that: | ||
|
||
4. An interstate edge can only access scalars, i.e. use them in their assignment or condition expressions, but not arrays, even if they have shape `(1,)`. | ||
|
||
- [Rationale]: If an array is also used in interstate edges it became very tedious to verify if the array could be removed or not. | ||
- [Note]: Running _simplify()_ might actually result in the violation of this rule, see note of rule 9. | ||
|
||
5. The state graph does not contain any cycles, i.e. the implementation of a for/while loop using states is not allowed, the new loop construct or serial maps must be used in that case. | ||
- [Rationale]: This is a simplification that makes it much simpler to define what "later in the computation" means, as we will never have a cycle. | ||
- [Note]: Currently the code generator does not support the `LoopRegion` construct and it is transformed to a state machine. | ||
|
||
#### Transients | ||
|
||
The rules we impose on transients are a bit more complicated, however, while sounding restrictive, they are very permissive. | ||
It is important to note that these rules only have to be met after _simplify()_ was called once on the SDFG: | ||
|
||
6. Downstream of a write access, i.e., in all states that follow the state where the access node is located, there are no other access nodes that are used to write to the same array. | ||
|
||
- [Rationale 1]: This rule, together with rule 7 and 8, essentially ensures that the assignment in the SDFG follows SSA style, while allowing for expressions such as: | ||
|
||
```python | ||
if cond: | ||
a = true_branch() | ||
else: | ||
a = false_branch() | ||
``` | ||
|
||
(**NOTE:** This could also be done with references, however, they are strongly discouraged.) | ||
|
||
- [Rationale 2]: This still allows reductions with WCR as they write to the same access node and loops, whose body modifies a transient that outlives the loop body, as they use the same access node. | ||
|
||
7. It is _recommended_ that a write access node should only have one incoming edge. | ||
|
||
- [Rationale]: This case is handled poorly by some DaCe transformations, thus we should avoid them as much as possible. | ||
|
||
8. No two access nodes in a state can refer to the same array. | ||
|
||
- [Rationale]: Together with rule 5 this guarantees SSA style. | ||
- [Note]: An SDFG can still be constructed using different access node for the same underlying data; _simplify()_ will combine them. | ||
|
||
9. Every access node that reads from an array (having an outgoing edge) that was not written to in the same state must be a source node. | ||
|
||
- [Rationale]: Together with rule 1, 4, 5, 6, 7 and 8 this simplifies checking if a transient can be safely removed or if it is used somewhere else. | ||
These rules guarantee that the number of "interstate transients" remains constant and this set is given by the _set of source nodes and all access nodes that have an outgoing degree larger than one_. | ||
- [Note]: To prevent some issues caused by the violation of rule 4 by _simplify()_, this set is extended with the transient sink nodes and all scalars. | ||
Excess interstate transients, that will be kept alive that way, will be removed by later calls to _simplify()_. | ||
|
||
10. Every AccessNode within a map scope must refer to a data descriptor whose lifetime must be `dace.dtypes.AllocationLifetime.Scope` and its storage class should either be `dace.dtypes.StorageType.Default` or _preferably_ `dace.dtypes.StorageType.Register`. | ||
- [Rationale 1]: This makes optimizations operating inside maps/kernels simpler, as it guarantees that the AccessNode does not propagate outside. | ||
- [Rationale 2]: The storage type avoids the need to dynamically allocate memory inside a kernel. | ||
|
||
#### Maps | ||
|
||
For maps we assume the following: | ||
|
||
11. The names of map variables (iteration variables) follow the following pattern. | ||
|
||
- 11.1: All map variables iterating over the same dimension (disregarding the actual range) have the same deterministic name, that includes the `gtx.Dimension.value` string. | ||
- 11.2: The name of horizontal dimensions (`kind` attribute) always end in `__gtx_horizontal`. | ||
- 11.3: The name of vertical dimensions (`kind` attribute) always end in `__gtx_vertical`. | ||
- 11.4: The name of local dimensions always ends in `__gtx_localdim`. | ||
- 11.5: No transformation is allowed to modify the name of an iteration variable that follows rules 11.2, 11.3 or 11.4. | ||
- [Rationale]: Without this rule it is very hard to tell which map variable does what, this way we can transmit information from GT4Py to DaCe, see also rule 12. | ||
|
||
12. Two map ranges, i.e. the pair map/iteration variable and range, can only be fused if they have the same name _and_ cover the same range. | ||
- [Rationale 1]: Because of rule 11, we will only fuse maps that actually makes sense to fuse. | ||
- [Rationale 2]: This allows fusing maps without renaming the map variables. | ||
- [Note]: This rule might be dropped in the future. | ||
|
||
## Consequences | ||
|
||
The rules outlined above impose a certain form of an SDFG. | ||
Most of these rules are designed to ensure that the SDFG follows SSA style and to simplify transformations, especially making validation checks simple, while imposing a minimal number of restrictions. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# GT4Py - GridTools Framework | ||
# | ||
# Copyright (c) 2014-2024, ETH Zurich | ||
# All rights reserved. | ||
# | ||
# Please, refer to the LICENSE file in the root directory. | ||
# SPDX-License-Identifier: BSD-3-Clause | ||
|
||
"""Transformation and optimization pipeline for the DaCe backend in GT4Py. | ||
|
||
Please also see [ADR0018](https://github.com/GridTools/gt4py/tree/main/docs/development/ADRs/0018-Canonical_SDFG_in_GT4Py_Transformations.md) | ||
that explains the general structure and requirements on the SDFGs. | ||
""" | ||
|
||
from .auto_opt import gt_auto_optimize, gt_set_iteration_order, gt_simplify | ||
from .gpu_utils import GPUSetBlockSize, gt_gpu_transformation, gt_set_gpu_blocksize | ||
from .loop_blocking import LoopBlocking | ||
from .map_orderer import MapIterationOrder | ||
from .map_promoter import SerialMapPromoter | ||
from .map_serial_fusion import SerialMapFusion | ||
|
||
|
||
__all__ = [ | ||
"GPUSetBlockSize", | ||
"LoopBlocking", | ||
"MapIterationOrder", | ||
"SerialMapFusion", | ||
"SerialMapPromoter", | ||
"SerialMapPromoterGPU", | ||
"gt_auto_optimize", | ||
"gt_gpu_transformation", | ||
"gt_set_iteration_order", | ||
"gt_set_gpu_blocksize", | ||
"gt_simplify", | ||
] |
Uh oh!
There was an error while loading. Please reload this page.