-
Notifications
You must be signed in to change notification settings - Fork 453
A few fixes to the MFU/MBU code #1108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
c2b547f to
a68ba0d
Compare
Added back all docstrings and inline comments that were removed during the sliding window implementation. These comments explain the assumptions, calculations, and design decisions in the FLOP and memory bandwidth estimation code. Changes: - Restored docstrings for all ModelDims methods (attn_flops, mlp_flops, prefill_flops, decode_flops, flops, weight_memory_bytes, kv_cache_write_bytes, kv_cache_read_bytes, prefill_memory_bytes, decode_memory_bytes, memory_bytes) - Restored inline comments explaining calculation details - Kept all functionality changes (sliding window support, A100 bandwidth fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Changed attn_flops signature from using a boolean use_sliding_window flag
to accepting the sliding_window value directly as an Optional[int]. This
makes the API cleaner and more explicit.
Changes:
- attn_flops now takes sliding_window: Optional[int] = None instead of
use_sliding_window: bool = False
- Uses kv_len = min(kv_len, sliding_window or float("inf")) to handle
None case elegantly
- Updated all call sites in prefill_flops and decode_flops to pass
sliding_window=None for full attention layers and
sliding_window=self.sliding_window for sliding window layers
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
| prefill_flops = model_dims.flops([sequence_length], None) | ||
| decode_flops = total_flops - prefill_flops | ||
| decode_flops_in_gflops = decode_flops / 1e9 | ||
| self.assertAlmostEqual(decode_flops_in_gflops, 27.92, delta=0.01) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comes from some sanity checking that I did manually for the olmo3 paper.
| total_bytes *= 2 | ||
|
|
||
| memory_in_gb = total_bytes / 1e9 | ||
| self.assertAlmostEqual(memory_in_gb, 3.926, delta=0.01) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comes from some sanity checking that I did manually for the olmo3 paper.
| actor_total_memory_bytes = model_dims.memory_bytes( | ||
| prompt_lengths, response_lengths, samples_per_prompt=samples_per_prompt | ||
| ) | ||
| num_inference_gpus = num_engines * num_gpus_per_engine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like you don't use this other than for the calculate function below. might want to just remove it and calculate it in the calculate_actor_util function
| num_params: int | None = None | ||
| device_name: str | None = None | ||
| sliding_window: int | None = None | ||
| num_sliding_window_layers: int = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this work for GQA models as well? will the group size just be the num_kv_heads?
Previously, we would occasionally get >100%.
Fixes #1098. Also adds logging so we can reproduce calculations if there are future issues with MFU/MBU calculations.
Fixes were:
Note
Refactors MFU/MBU to account for vLLM engines/TP, sliding-window attention, and corrected GPU specs, and adds tests + fixtures to verify utilization never exceeds 100%.
utils.ModelDimsto compute FLOPs/bytes with sliding-window attention, propernum_kv_heads, and per-engine memory (memory_bytes) averaging.calculate_mfu,calculate_mbu,calculate_actor_utilization,calculate_learner_utilization, andcheck_calculation(warns with repro JSON).GPU_SPECS['a100'].memory_bandwidthto2.0e12.from_vllm_configto usehf_text_confighead counts and sliding-window layer detection.grpo_fast.calculate_utilization_metricsnow takesnum_engines/num_gpus_per_engine, uses newModelDimsutilization helpers; call sites updated.benchmark_generators.pycomputesmfu/mbuviaModelDims.calculate_mfu/mbuwith engine/TP parameters.open_instruct/test_data/mbu_reproduction_cases.jsonand new tests intest_utils.pyfor FLOPs/memory sanity, multi-engine utilization, vLLM config parity, and MBU reproduction (asserts <= 100%).Written by Cursor Bugbot for commit f6ec329. This will update automatically on new commits. Configure here.