Replace nested static_for lambdas with compile-time search helper by assistant-librarian[bot] · Pull Request #4287 · ROCm/rocm-libraries

assistant-librarian · 2026-02-03T21:49:37Z

Summary

Add find_in_tuple_of_sequences compile-time search helper with O(1) template depth
Replace nested static_for lambdas in TensorDescriptor::GetTransformAndItsUpperDimension
Replace generate_tuple lambda in TensorDescriptor::InitializeElementSize with pack expansion
Apply same optimizations to TensorAdaptor

Motivation

The TensorDescriptor and TensorAdaptor classes had excessive template instantiation from:

Nested static_for loops with lambdas (918 applier::operator() instantiations)
generate_tuple with lambdas (78+ instantiations per class)

Why It Works

Each lambda creates a unique closure type, causing separate instantiations at every call site. The find_in_tuple_of_sequences helper uses O(1) template depth via pack expansion instead of O(N) nested static_for recursion, and named functors share a single type across all uses.

Results (example_grouped_conv_fwd_xdl_fp16)

Metric	Before	After	Improvement
Template instantiation time	23.4s	19.1s	18% reduction
`applier` instantiations	1132	127	89% reduction
`generate_tuple` lambdas	178	96	46% reduction

Test Plan

Added 11 unit tests:
- 5 tests for sequence_find_value
- 6 tests for find_in_tuple_of_sequences
Waiting for full CI

PR Stack

This PR is part of the build time optimization effort (issue #4229). All PRs now target develop independently:

#	PR	Description	Status
1	ROCm/composable_kernel#3585	sequence_gen with `__make_integer_seq`	Independent
2	#4283	generate_identity_sequences + named functors	New (replaces ROCm/composable_kernel#3588, ROCm/composable_kernel#3589)
3	#4290	container_concat optimization	Independent
4	#4288	O(1) pack expansion rewrites	Independent
5	#4287	TensorDescriptor/TensorAdaptor lambda elimination	This PR

Tracking issue: #4229

🔁 Imported from ROCm/composable_kernel#3600
🧑‍💻 Originally authored by @tenpercent

The GetTransformAndItsUpperDimension function used nested static_for loops with lambdas to search for a hidden dimension in UpperDimensionIdss. This caused 918 applier::operator() instantiations (81% of all applier instantiations). Replace with find_in_tuple_of_sequences helper that uses constexpr array lookup and if-constexpr recursion, eliminating the lambda instantiation overhead. Results on example_grouped_conv_fwd_xdl_fp16: - applier instantiations: 1132 -> 127 (89% reduction) - TensorDescriptor instantiations: 2503 -> 664 (73% reduction) - Template instantiation time: 23.4s -> 19.4s (17% reduction)

…tSize The InitializeElementSize function used generate_tuple with a lambda to compute visible dimension lengths. Each TensorDescriptor type created a unique lambda type, causing 78 instantiations (385ms). Replace with direct pack expansion using helper functions, eliminating the lambda instantiation overhead entirely. Results on example_grouped_conv_fwd_xdl_fp16: - generate_tuple lambdas: 178 -> 100 (44% reduction) - Template instantiation time: 19.5s -> 19.0s

TensorAdaptor has identical InitializeElementSize and GetTransformAndItsUpperDimension patterns as TensorDescriptor. Apply the same optimization: - Replace nested static_for lambdas with find_in_tuple_of_sequences - Replace generate_tuple lambda with pack expansion Results: generate_tuple lambdas 100 -> 96 (4 events, 17ms eliminated)

Detailed comments explain: - sequence_find_value: Constexpr loop with O(1) template depth vs O(N) recursive - find_in_tuple_of_sequences: Pack expansion instead of nested static_for loops - Why constexpr search reduces template instantiations dramatically - When to apply constexpr search patterns for compile-time operations - Implementation details for each optimization approach This documentation helps maintainers understand the compile-time search optimization strategy without relying on specific benchmark numbers that may vary by use case.

…evelop/ROCm_composable_kernel/pr-3600

tenpercent and others added 7 commits January 22, 2026 01:11

Add unit tests for sequence_find_value and find_in_tuple_of_sequences

83a76d7

Apply clang-format with -style=file

19a156a

Merge commit '19a156aa0a609a14e16d8efeb21708fd6968fafd' into import/d…

5872eef

…evelop/ROCm_composable_kernel/pr-3600

assistant-librarian bot added the imported pr label Feb 3, 2026

github-actions bot added the project: composablekernel label Feb 3, 2026

assistant-librarian bot added the external contribution Code contribution from users community.. label Feb 3, 2026

DDEle mentioned this pull request Feb 10, 2026

Add container and tuple optimization helpers #4290

Draft

2 tasks

illsilin assigned cgmillette Feb 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace nested static_for lambdas with compile-time search helper#4287

Replace nested static_for lambdas with compile-time search helper#4287
assistant-librarian[bot] wants to merge 7 commits intodevelopfrom
import/develop/ROCm_composable_kernel/pr-3600

assistant-librarian bot commented Feb 3, 2026 •

edited by DDEle

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

assistant-librarian bot commented Feb 3, 2026 • edited by DDEle Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Why It Works

Results (example_grouped_conv_fwd_xdl_fp16)

Test Plan

PR Stack

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

assistant-librarian bot commented Feb 3, 2026 •

edited by DDEle

Loading