Skip to content

Replace nested static_for lambdas with compile-time search helper#4287

Draft
assistant-librarian[bot] wants to merge 7 commits intodevelopfrom
import/develop/ROCm_composable_kernel/pr-3600
Draft

Replace nested static_for lambdas with compile-time search helper#4287
assistant-librarian[bot] wants to merge 7 commits intodevelopfrom
import/develop/ROCm_composable_kernel/pr-3600

Conversation

@assistant-librarian
Copy link
Contributor

@assistant-librarian assistant-librarian bot commented Feb 3, 2026

Summary

  • Add find_in_tuple_of_sequences compile-time search helper with O(1) template depth
  • Replace nested static_for lambdas in TensorDescriptor::GetTransformAndItsUpperDimension
  • Replace generate_tuple lambda in TensorDescriptor::InitializeElementSize with pack expansion
  • Apply same optimizations to TensorAdaptor

Motivation

The TensorDescriptor and TensorAdaptor classes had excessive template instantiation from:

  1. Nested static_for loops with lambdas (918 applier::operator() instantiations)
  2. generate_tuple with lambdas (78+ instantiations per class)

Why It Works

Each lambda creates a unique closure type, causing separate instantiations at every call site. The find_in_tuple_of_sequences helper uses O(1) template depth via pack expansion instead of O(N) nested static_for recursion, and named functors share a single type across all uses.

Results (example_grouped_conv_fwd_xdl_fp16)

Metric Before After Improvement
Template instantiation time 23.4s 19.1s 18% reduction
applier instantiations 1132 127 89% reduction
generate_tuple lambdas 178 96 46% reduction

Test Plan

  • Added 11 unit tests:
    • 5 tests for sequence_find_value
    • 6 tests for find_in_tuple_of_sequences
  • Waiting for full CI

PR Stack

This PR is part of the build time optimization effort (issue #4229). All PRs now target develop independently:

# PR Description Status
1 ROCm/composable_kernel#3585 sequence_gen with __make_integer_seq Independent
2 #4283 generate_identity_sequences + named functors New (replaces ROCm/composable_kernel#3588, ROCm/composable_kernel#3589)
3 #4290 container_concat optimization Independent
4 #4288 O(1) pack expansion rewrites Independent
5 #4287 TensorDescriptor/TensorAdaptor lambda elimination This PR

Tracking issue: #4229


🔁 Imported from ROCm/composable_kernel#3600
🧑‍💻 Originally authored by @tenpercent

tenpercent and others added 7 commits January 22, 2026 01:11
The GetTransformAndItsUpperDimension function used nested static_for
loops with lambdas to search for a hidden dimension in UpperDimensionIdss.
This caused 918 applier::operator() instantiations (81% of all applier
instantiations).

Replace with find_in_tuple_of_sequences helper that uses constexpr
array lookup and if-constexpr recursion, eliminating the lambda
instantiation overhead.

Results on example_grouped_conv_fwd_xdl_fp16:
- applier instantiations: 1132 -> 127 (89% reduction)
- TensorDescriptor instantiations: 2503 -> 664 (73% reduction)
- Template instantiation time: 23.4s -> 19.4s (17% reduction)
…tSize

The InitializeElementSize function used generate_tuple with a lambda to
compute visible dimension lengths. Each TensorDescriptor type created
a unique lambda type, causing 78 instantiations (385ms).

Replace with direct pack expansion using helper functions, eliminating
the lambda instantiation overhead entirely.

Results on example_grouped_conv_fwd_xdl_fp16:
- generate_tuple lambdas: 178 -> 100 (44% reduction)
- Template instantiation time: 19.5s -> 19.0s
TensorAdaptor has identical InitializeElementSize and
GetTransformAndItsUpperDimension patterns as TensorDescriptor.
Apply the same optimization:
- Replace nested static_for lambdas with find_in_tuple_of_sequences
- Replace generate_tuple lambda with pack expansion

Results: generate_tuple lambdas 100 -> 96 (4 events, 17ms eliminated)
Detailed comments explain:
- sequence_find_value: Constexpr loop with O(1) template depth vs O(N) recursive
- find_in_tuple_of_sequences: Pack expansion instead of nested static_for loops
- Why constexpr search reduces template instantiations dramatically
- When to apply constexpr search patterns for compile-time operations
- Implementation details for each optimization approach

This documentation helps maintainers understand the compile-time search optimization
strategy without relying on specific benchmark numbers that may vary by use case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants