-
Notifications
You must be signed in to change notification settings - Fork 142
Add threadblock map transformation #2116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ock-map-transformation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks quite good overall. I have minor comments/questions.
@@ -0,0 +1,285 @@ | |||
# Copyright 2019-2023 ETH Zurich and the DaCe authors. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update year
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -0,0 +1,285 @@ | |||
# Copyright 2019-2023 ETH Zurich and the DaCe authors. All rights reserved. | |||
""" This module contains classes and functions that implement the grid-strided map tiling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
map_entry = self.map_entry | ||
|
||
# Find the state that contains the map entry | ||
state = next(state for node, state in sdfg.all_nodes_recursive() if node == map_entry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a bit wasteful. Would it make sense to follow the {graph, graph.parent_graph, ...}
chain until you find a state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is wasteful. It can be even solved simpler. I overlooked that the input graph
is actually the state containing map_entry
. I have corrected this and updated the function signature with type annotations (and renamed graph
to state
). Thank you!
return False # Already has GPU-scheduled inner scope — does not apply | ||
|
||
# Check if the map is nested inside another GPU-scheduled map | ||
parent_map_tuple = helpers.get_parent_map(state, map_entry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think that we support GPU-device map inside any other GPU-scheduled map, but I amy not be up-to-date with developments. However, in such a case, you can omit the following check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, nested GPU_Device maps are supported and are (rarely) used (See tests/codegen/warp_specialization_test.py
)
new_kernel_entry.map.gpu_block_size = gpu_block_size | ||
|
||
# Catch any unexpected mismatches of inserted threadblock map's block size and the used block size | ||
tb_size = to_3d_dims([symbolic.overapproximate(sz) for sz in thread_block_map_entry.map.range.size()[::-1]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a sanity/debug check, or can this actually fail? In the latter case, would it make sense to move this check to the can_be_applied
method and just not apply the transformation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was only a sanity check for me, it did never fail. I have removed it.
Description
I implemented a transformation which adds an explicit
GPU_ThreadBlock
-scheduled map to aGPU_Device
-scheduled map ifit does not already have one (in certain cases). Note that thread blocks are always present and used in CUDA kernels. Making
them explicit in the SDFG is good practice and promotes modularity.
I implemented this transformation as part of my Master’s thesis work, aiming to closely replicate the previous behavior to ensure
backwards compatibility. Since only minimal changes were needed to make it useful in combination with
CUDACodeGen
as well,I am submitting it as a pull request.
Example
Below is an example illustrating the effect of the transformation on a simple SDFG. The transformation adds an explicit
GPU_ThreadBlock
-scheduled map with default block sizes obtained from the configuration.