Skip to content

Conversation

aydogdub
Copy link
Contributor

Description

I implemented a transformation which adds an explicit GPU_ThreadBlock-scheduled map to a GPU_Device-scheduled map if
it does not already have one (in certain cases). Note that thread blocks are always present and used in CUDA kernels. Making
them explicit in the SDFG is good practice and promotes modularity.

I implemented this transformation as part of my Master’s thesis work, aiming to closely replicate the previous behavior to ensure
backwards compatibility. Since only minimal changes were needed to make it useful in combination with CUDACodeGen as well,
I am submitting it as a pull request.

Example

Below is an example illustrating the effect of the transformation on a simple SDFG. The transformation adds an explicit GPU_ThreadBlock-scheduled map with default block sizes obtained from the configuration.

grafik grafik

@ThrudPrimrose ThrudPrimrose marked this pull request as ready for review September 18, 2025 08:57
Copy link
Contributor

@alexnick83 alexnick83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite good overall. I have minor comments/questions.

@@ -0,0 +1,285 @@
# Copyright 2019-2023 ETH Zurich and the DaCe authors. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update year

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -0,0 +1,285 @@
# Copyright 2019-2023 ETH Zurich and the DaCe authors. All rights reserved.
""" This module contains classes and functions that implement the grid-strided map tiling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update docstring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

map_entry = self.map_entry

# Find the state that contains the map entry
state = next(state for node, state in sdfg.all_nodes_recursive() if node == map_entry)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit wasteful. Would it make sense to follow the {graph, graph.parent_graph, ...} chain until you find a state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is wasteful. It can be even solved simpler. I overlooked that the input graph is actually the state containing map_entry . I have corrected this and updated the function signature with type annotations (and renamed graph to state). Thank you!

return False # Already has GPU-scheduled inner scope — does not apply

# Check if the map is nested inside another GPU-scheduled map
parent_map_tuple = helpers.get_parent_map(state, map_entry)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that we support GPU-device map inside any other GPU-scheduled map, but I amy not be up-to-date with developments. However, in such a case, you can omit the following check.

Copy link
Contributor Author

@aydogdub aydogdub Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, nested GPU_Device maps are supported and are (rarely) used (See tests/codegen/warp_specialization_test.py)

new_kernel_entry.map.gpu_block_size = gpu_block_size

# Catch any unexpected mismatches of inserted threadblock map's block size and the used block size
tb_size = to_3d_dims([symbolic.overapproximate(sz) for sz in thread_block_map_entry.map.range.size()[::-1]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a sanity/debug check, or can this actually fail? In the latter case, would it make sense to move this check to the can_be_applied method and just not apply the transformation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was only a sanity check for me, it did never fail. I have removed it.

@phschaad phschaad added this pull request to the merge queue Sep 23, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Sep 23, 2025
@phschaad phschaad added this pull request to the merge queue Oct 2, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants