Skip to content

Conversation

@LoserCheems
Copy link
Collaborator

Introduces a new benchmark function that skips topk computation to isolate
pure kernel performance from selection overhead. This enables more accurate
measurement of the core attention mechanism by removing the topk bottleneck.

Expands benchmark configurations to include inference scenarios with varying
key lengths and updates output formatting to display both standard and
no-topk performance metrics with speedup comparisons.

Provides insights into topk overhead impact on overall performance.

Introduces a new benchmark function that skips topk computation to isolate
pure kernel performance from selection overhead. This enables more accurate
measurement of the core attention mechanism by removing the topk bottleneck.

Expands benchmark configurations to include inference scenarios with varying
key lengths and updates output formatting to display both standard and
no-topk performance metrics with speedup comparisons.

Provides insights into topk overhead impact on overall performance.
Copilot AI review requested due to automatic review settings June 21, 2025 03:39
@LoserCheems LoserCheems self-assigned this Jun 21, 2025
@LoserCheems LoserCheems requested review from Evanwu1125, SNHuan and wubingheng111 and removed request for Copilot June 21, 2025 03:39
@LoserCheems LoserCheems added the feature New feature request label Jun 21, 2025

This comment was marked as outdated.

Changes keep_window_size from key_len to 0 when topk is disabled, preventing potential memory issues or incorrect attention calculations.

Also removes trailing whitespace from main function call.
@LoserCheems LoserCheems requested a review from Copilot June 21, 2025 03:49
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new benchmark function that isolates kernel performance by skipping the topk computation. It expands benchmark configurations to capture both standard and no-topk performance and updates output formatting to include additional speedup comparisons.

  • Added a new function dynamic_mask_attention_cuda_no_topk to measure kernel performance without topk overhead.
  • Updated benchmark loops and result reporting to include no-topk metrics.
  • Extended the benchmark configuration with new inference scenarios.
Comments suppressed due to low confidence (1)

benchmarks/benchmark_forward_performance.py:248

  • Consider adding an inline comment explaining why keep_window_size is set to 0 in the subsequent call to apply_dynamic_mask_attention, clarifying its role in skipping the topk computation for kernel performance measurements.
    zero_hold_states = calculate_zero_hold_states(value_states, dt_proj, A, causal_mask)

torch.cuda.empty_cache()

# Warmup runs
for _ in range(warmup_runs):
Copy link

Copilot AI Jun 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The loops for warmup and benchmark runs for dynamic_mask_attention_no_topk are similar to those used for the standard dynamic_mask_attention. Consider refactoring these into a helper function to reduce duplication and improve maintainability.

Copilot uses AI. Check for mistakes.
Comment on lines +541 to +542
print(f"🔧 {'Configuration':<42} {'Flash (ms)':<12} 🚀 {'DMA (ms)':<12} 🚀 {'DMA-Skip-All (ms)':<22} 📈 {'Speedup':<12} 📈 {'Skip-All-Speedup':<20} 💾 {'Memory':<10}")
print("🔄" + "-" * 155 + "🔄")
Copy link

Copilot AI Jun 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The header printing uses hard-coded column widths. Consider extracting these widths as constants or using a formatting helper to improve flexibility and maintainability.

Suggested change
print(f"🔧 {'Configuration':<42}{'Flash (ms)':<12} 🚀 {'DMA (ms)':<12} 🚀 {'DMA-Skip-All (ms)':<22} 📈 {'Speedup':<12} 📈 {'Skip-All-Speedup':<20} 💾 {'Memory':<10}")
print("🔄" + "-" * 155 + "🔄")
print(f"🔧 {'Configuration':<{CONFIGURATION_WIDTH}}{'Flash (ms)':<{FLASH_WIDTH}} 🚀 {'DMA (ms)':<{DMA_WIDTH}} 🚀 {'DMA-Skip-All (ms)':<{DMA_SKIP_ALL_WIDTH}} 📈 {'Speedup':<{SPEEDUP_WIDTH}} 📈 {'Skip-All-Speedup':<{SKIP_ALL_SPEEDUP_WIDTH}} 💾 {'Memory':<{MEMORY_WIDTH}}")
print("🔄" + "-" * (CONFIGURATION_WIDTH + FLASH_WIDTH + DMA_WIDTH + DMA_SKIP_ALL_WIDTH + SPEEDUP_WIDTH + SKIP_ALL_SPEEDUP_WIDTH + MEMORY_WIDTH + 20) + "🔄")

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@SNHuan SNHuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@SNHuan SNHuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@LoserCheems LoserCheems merged commit 477c47e into main Jun 21, 2025
@LoserCheems LoserCheems deleted the Update-benchmarks branch November 13, 2025 04:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants