Skip to content

Conversation

@finbarrtimbers
Copy link
Collaborator

@finbarrtimbers finbarrtimbers commented Oct 22, 2025

Previously, we would occasionally get >100%.

Fixes #1098. Also adds logging so we can reproduce calculations if there are future issues with MFU/MBU calculations.

Fixes were:

  1. We now properly account for sliding window attention.
  2. Updated A100 memory bandwidth to be for the 80GB version and not the 40GB one.
  3. Fixed a bug in the way we were account for the number of heads (taken from the vLLM parallel_config, which is wrong, as that is divided by the way we shard).
  4. Properly accounts for multiple devices.

Note

Refactors MFU/MBU to account for vLLM engines/TP, sliding-window attention, and corrected GPU specs, and adds tests + fixtures to verify utilization never exceeds 100%.

  • Metrics/Utils:
    • Refactor utils.ModelDims to compute FLOPs/bytes with sliding-window attention, proper num_kv_heads, and per-engine memory (memory_bytes) averaging.
    • Add calculate_mfu, calculate_mbu, calculate_actor_utilization, calculate_learner_utilization, and check_calculation (warns with repro JSON).
    • Update GPU_SPECS['a100'].memory_bandwidth to 2.0e12.
    • Improve from_vllm_config to use hf_text_config head counts and sliding-window layer detection.
  • Training/Benchmark:
    • grpo_fast.calculate_utilization_metrics now takes num_engines/num_gpus_per_engine, uses new ModelDims utilization helpers; call sites updated.
    • benchmark_generators.py computes mfu/mbu via ModelDims.calculate_mfu/mbu with engine/TP parameters.
  • Tests & Data:
    • Add open_instruct/test_data/mbu_reproduction_cases.json and new tests in test_utils.py for FLOPs/memory sanity, multi-engine utilization, vLLM config parity, and MBU reproduction (asserts <= 100%).

Written by Cursor Bugbot for commit f6ec329. This will update automatically on new commits. Configure here.

@finbarrtimbers finbarrtimbers marked this pull request as ready for review October 28, 2025 18:42
cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

finbarrtimbers and others added 8 commits October 29, 2025 13:28
Added back all docstrings and inline comments that were removed during
the sliding window implementation. These comments explain the assumptions,
calculations, and design decisions in the FLOP and memory bandwidth
estimation code.

Changes:
- Restored docstrings for all ModelDims methods (attn_flops, mlp_flops,
  prefill_flops, decode_flops, flops, weight_memory_bytes,
  kv_cache_write_bytes, kv_cache_read_bytes, prefill_memory_bytes,
  decode_memory_bytes, memory_bytes)
- Restored inline comments explaining calculation details
- Kept all functionality changes (sliding window support, A100 bandwidth fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Changed attn_flops signature from using a boolean use_sliding_window flag
to accepting the sliding_window value directly as an Optional[int]. This
makes the API cleaner and more explicit.

Changes:
- attn_flops now takes sliding_window: Optional[int] = None instead of
  use_sliding_window: bool = False
- Uses kv_len = min(kv_len, sliding_window or float("inf")) to handle
  None case elegantly
- Updated all call sites in prefill_flops and decode_flops to pass
  sliding_window=None for full attention layers and
  sliding_window=self.sliding_window for sliding window layers

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
cursor[bot]

This comment was marked as outdated.

prefill_flops = model_dims.flops([sequence_length], None)
decode_flops = total_flops - prefill_flops
decode_flops_in_gflops = decode_flops / 1e9
self.assertAlmostEqual(decode_flops_in_gflops, 27.92, delta=0.01)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes from some sanity checking that I did manually for the olmo3 paper.

total_bytes *= 2

memory_in_gb = total_bytes / 1e9
self.assertAlmostEqual(memory_in_gb, 3.926, delta=0.01)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes from some sanity checking that I did manually for the olmo3 paper.

cursor[bot]

This comment was marked as outdated.

actor_total_memory_bytes = model_dims.memory_bytes(
prompt_lengths, response_lengths, samples_per_prompt=samples_per_prompt
)
num_inference_gpus = num_engines * num_gpus_per_engine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like you don't use this other than for the calculate function below. might want to just remove it and calculate it in the calculate_actor_util function

num_params: int | None = None
device_name: str | None = None
sliding_window: int | None = None
num_sliding_window_layers: int = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this work for GQA models as well? will the group size just be the num_kv_heads?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unsure: actor MBU might have a bug

3 participants