Skip to content

Replace O(N) recursive sequence_map_inverse with O(1) pack expansion#4288

Closed
assistant-librarian[bot] wants to merge 8 commits intodevelopfrom
import/develop/ROCm_composable_kernel/pr-3596
Closed

Replace O(N) recursive sequence_map_inverse with O(1) pack expansion#4288
assistant-librarian[bot] wants to merge 8 commits intodevelopfrom
import/develop/ROCm_composable_kernel/pr-3596

Conversation

@assistant-librarian
Copy link
Contributor

@assistant-librarian assistant-librarian bot commented Feb 3, 2026

Summary

Replace the O(N) recursive sequence_map_inverse implementation with O(1) template depth using pack expansion to reduce compile time (#4229).

Approach

  • Use constexpr loop in find_source_index to locate permutation inverse indices
  • Expand via pack expansion for O(1) template instantiation depth

Why It Works

Template recursion requires N template instantiations for N iterations, each with its own overhead. Constexpr loops execute within a single template instantiation, avoiding per-instantiation overhead.

Build Performance Impact

Template Instantiation Reduction (measured on device_grouped_conv3d_fwd_bias_bnorm_clamp_instance target, 248 files):

  • Baseline: 7,748,880 total instantiations
  • This change: 7,621,984 total instantiations
  • Improvement: -126,896 instantiations (1.6% reduction)

This confirms the optimization successfully reduces template instantiation overhead by eliminating recursive template patterns in favor of pack expansion.

Test Plan

  • Existing SequenceMapInverse.InverseMap and SequenceMapInverse.InverseIdentityMap tests validate correctness
  • CI

🔁 Imported from ROCm/composable_kernel#3596
🧑‍💻 Originally authored by @tenpercent

Use explicit for loop instead of fold expression to build the inverse
permutation array. This avoids potential issues with fold expression
depth limits while maintaining the same O(1) template instantiation
depth optimization.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tenpercent tenpercent force-pushed the import/develop/ROCm_composable_kernel/pr-3596 branch from c7bcd1f to 027528f Compare February 10, 2026 22:06
};
InverseArray result{};
constexpr index_t input[] = {Is...};
for(index_t pos = 0; pos < static_cast<index_t>(sizeof...(Is)); ++pos)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed with @cgmillette and for-loop is more readable while impact on build time is non-measurable

@tenpercent tenpercent self-requested a review February 10, 2026 22:23
cgmillette pushed a commit that referenced this pull request Feb 11, 2026
Use @tenpercent's implementation from #4288.
This change uses a constexpr for loop to build the inverse in O(N)
operations with O(1) template instantiation depth.

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
@cgmillette
Copy link
Contributor

cgmillette commented Feb 11, 2026

Merged to #4447 . The new PR has proper CI running

@tenpercent
Copy link
Contributor

Merge with #4447

@tenpercent tenpercent closed this Feb 11, 2026
@cgmillette cgmillette deleted the import/develop/ROCm_composable_kernel/pr-3596 branch February 11, 2026 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants