-
Notifications
You must be signed in to change notification settings - Fork 39
Adds no-topk variant for kernel performance analysis #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Introduces a new benchmark function that skips topk computation to isolate pure kernel performance from selection overhead. This enables more accurate measurement of the core attention mechanism by removing the topk bottleneck. Expands benchmark configurations to include inference scenarios with varying key lengths and updates output formatting to display both standard and no-topk performance metrics with speedup comparisons. Provides insights into topk overhead impact on overall performance.
Changes keep_window_size from key_len to 0 when topk is disabled, preventing potential memory issues or incorrect attention calculations. Also removes trailing whitespace from main function call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new benchmark function that isolates kernel performance by skipping the topk computation. It expands benchmark configurations to capture both standard and no-topk performance and updates output formatting to include additional speedup comparisons.
- Added a new function dynamic_mask_attention_cuda_no_topk to measure kernel performance without topk overhead.
- Updated benchmark loops and result reporting to include no-topk metrics.
- Extended the benchmark configuration with new inference scenarios.
Comments suppressed due to low confidence (1)
benchmarks/benchmark_forward_performance.py:248
- Consider adding an inline comment explaining why keep_window_size is set to 0 in the subsequent call to apply_dynamic_mask_attention, clarifying its role in skipping the topk computation for kernel performance measurements.
zero_hold_states = calculate_zero_hold_states(value_states, dt_proj, A, causal_mask)
| torch.cuda.empty_cache() | ||
|
|
||
| # Warmup runs | ||
| for _ in range(warmup_runs): |
Copilot
AI
Jun 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The loops for warmup and benchmark runs for dynamic_mask_attention_no_topk are similar to those used for the standard dynamic_mask_attention. Consider refactoring these into a helper function to reduce duplication and improve maintainability.
| print(f"🔧 {'Configuration':<42} ⚡ {'Flash (ms)':<12} 🚀 {'DMA (ms)':<12} 🚀 {'DMA-Skip-All (ms)':<22} 📈 {'Speedup':<12} 📈 {'Skip-All-Speedup':<20} 💾 {'Memory':<10}") | ||
| print("🔄" + "-" * 155 + "🔄") |
Copilot
AI
Jun 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The header printing uses hard-coded column widths. Consider extracting these widths as constants or using a formatting helper to improve flexibility and maintainability.
| print(f"🔧 {'Configuration':<42} ⚡ {'Flash (ms)':<12} 🚀 {'DMA (ms)':<12} 🚀 {'DMA-Skip-All (ms)':<22} 📈 {'Speedup':<12} 📈 {'Skip-All-Speedup':<20} 💾 {'Memory':<10}") | |
| print("🔄" + "-" * 155 + "🔄") | |
| print(f"🔧 {'Configuration':<{CONFIGURATION_WIDTH}} ⚡ {'Flash (ms)':<{FLASH_WIDTH}} 🚀 {'DMA (ms)':<{DMA_WIDTH}} 🚀 {'DMA-Skip-All (ms)':<{DMA_SKIP_ALL_WIDTH}} 📈 {'Speedup':<{SPEEDUP_WIDTH}} 📈 {'Skip-All-Speedup':<{SKIP_ALL_SPEEDUP_WIDTH}} 💾 {'Memory':<{MEMORY_WIDTH}}") | |
| print("🔄" + "-" * (CONFIGURATION_WIDTH + FLASH_WIDTH + DMA_WIDTH + DMA_SKIP_ALL_WIDTH + SPEEDUP_WIDTH + SKIP_ALL_SPEEDUP_WIDTH + MEMORY_WIDTH + 20) + "🔄") |
SNHuan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
SNHuan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Introduces a new benchmark function that skips topk computation to isolate
pure kernel performance from selection overhead. This enables more accurate
measurement of the core attention mechanism by removing the topk bottleneck.
Expands benchmark configurations to include inference scenarios with varying
key lengths and updates output formatting to display both standard and
no-topk performance metrics with speedup comparisons.
Provides insights into topk overhead impact on overall performance.