A few fixes to the MFU/MBU code #1108

finbarrtimbers · 2025-10-22T14:43:27Z

Previously, we would occasionally get >100%.

Fixes #1098. Also adds logging so we can reproduce calculations if there are future issues with MFU/MBU calculations.

Fixes were:

We now properly account for sliding window attention.
Updated A100 memory bandwidth to be for the 80GB version and not the 40GB one.
Fixed a bug in the way we were account for the number of heads (taken from the vLLM parallel_config, which is wrong, as that is divided by the way we shard).
Properly accounts for multiple devices.

Note

Refactors MFU/MBU to account for vLLM engines/TP, sliding-window attention, and corrected GPU specs, and adds tests + fixtures to verify utilization never exceeds 100%.

Metrics/Utils:
- Refactor utils.ModelDims to compute FLOPs/bytes with sliding-window attention, proper num_kv_heads, and per-engine memory (memory_bytes) averaging.
- Add calculate_mfu, calculate_mbu, calculate_actor_utilization, calculate_learner_utilization, and check_calculation (warns with repro JSON).
- Update GPU_SPECS['a100'].memory_bandwidth to 2.0e12.
- Improve from_vllm_config to use hf_text_config head counts and sliding-window layer detection.
Training/Benchmark:
- grpo_fast.calculate_utilization_metrics now takes num_engines/num_gpus_per_engine, uses new ModelDims utilization helpers; call sites updated.
- benchmark_generators.py computes mfu/mbu via ModelDims.calculate_mfu/mbu with engine/TP parameters.
Tests & Data:
- Add open_instruct/test_data/mbu_reproduction_cases.json and new tests in test_utils.py for FLOPs/memory sanity, multi-engine utilization, vLLM config parity, and MBU reproduction (asserts <= 100%).

^{Written by Cursor Bugbot for commit f6ec329. This will update automatically on new commits. Configure here.}

Added back all docstrings and inline comments that were removed during the sliding window implementation. These comments explain the assumptions, calculations, and design decisions in the FLOP and memory bandwidth estimation code. Changes: - Restored docstrings for all ModelDims methods (attn_flops, mlp_flops, prefill_flops, decode_flops, flops, weight_memory_bytes, kv_cache_write_bytes, kv_cache_read_bytes, prefill_memory_bytes, decode_memory_bytes, memory_bytes) - Restored inline comments explaining calculation details - Kept all functionality changes (sliding window support, A100 bandwidth fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Changed attn_flops signature from using a boolean use_sliding_window flag to accepting the sliding_window value directly as an Optional[int]. This makes the API cleaner and more explicit. Changes: - attn_flops now takes sliding_window: Optional[int] = None instead of use_sliding_window: bool = False - Uses kv_len = min(kv_len, sliding_window or float("inf")) to handle None case elegantly - Updated all call sites in prefill_flops and decode_flops to pass sliding_window=None for full attention layers and sliding_window=self.sliding_window for sliding window layers 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

finbarrtimbers · 2025-10-30T17:56:43Z

open_instruct/test_utils.py

+        prefill_flops = model_dims.flops([sequence_length], None)
+        decode_flops = total_flops - prefill_flops
+        decode_flops_in_gflops = decode_flops / 1e9
+        self.assertAlmostEqual(decode_flops_in_gflops, 27.92, delta=0.01)


This comes from some sanity checking that I did manually for the olmo3 paper.

finbarrtimbers · 2025-10-30T17:56:47Z

open_instruct/test_utils.py

+        total_bytes *= 2
+
+        memory_in_gb = total_bytes / 1e9
+        self.assertAlmostEqual(memory_in_gb, 3.926, delta=0.01)


This comes from some sanity checking that I did manually for the olmo3 paper.

mnoukhov · 2025-11-03T18:34:50Z

open_instruct/grpo_fast.py

-    actor_total_memory_bytes = model_dims.memory_bytes(
-        prompt_lengths, response_lengths, samples_per_prompt=samples_per_prompt
-    )
+    num_inference_gpus = num_engines * num_gpus_per_engine


seems like you don't use this other than for the calculate function below. might want to just remove it and calculate it in the calculate_actor_util function

mnoukhov · 2025-11-03T18:50:52Z

open_instruct/utils.py

+    num_params: int | None = None
    device_name: str | None = None
+    sliding_window: int | None = None
+    num_sliding_window_layers: int = 0


will this work for GQA models as well? will the group size just be the num_kv_heads?

finbarrtimbers mentioned this pull request Oct 22, 2025

Unsure: actor MBU might have a bug #1098

Open

finbarrtimbers marked this pull request as ready for review October 28, 2025 18:42

finbarrtimbers added 6 commits October 28, 2025 12:44

Now, we get num_attention_heads from the hf config.

bad8e77

Update code

76600a8

Added test that we match manual values

088d486

Updated calculations

d37f591

Updated code with check_calculation

4c185b4

Updated code

a68ba0d

finbarrtimbers force-pushed the fix-modeldims branch from c2b547f to a68ba0d Compare October 28, 2025 18:44

This comment was marked as outdated.

Sign in to view

Now, tests pass.

1c1de09

This comment was marked as outdated.

Sign in to view

finbarrtimbers added 2 commits October 28, 2025 14:51

Updated code to normalize properly

b4fb73d

Added some fixes

fc6c709

This comment was marked as outdated.

Sign in to view

Merge branch 'main' into fix-modeldims

d9191c0

This comment was marked as outdated.

Sign in to view

finbarrtimbers added 3 commits October 29, 2025 09:47

Updated code

f0972e4

Updated code

82ee5a9

Another fix

a67d501

This comment was marked as outdated.

Sign in to view

finbarrtimbers and others added 8 commits October 29, 2025 13:28

Updated code to fix errors from cursor review

c7afce7

Merge branch 'main' into fix-modeldims

72ca29b

Cleaned up tests.

839162b

cleaned up code

e7d697e

Cleaned up PR

427cd48

updated code

b94921c

This comment was marked as outdated.

Sign in to view

finbarrtimbers added 8 commits October 29, 2025 14:46

Fixed bug in tests

b944834

Updates code

cb0f732

Merge branch 'main' into fix-modeldims

df2a9df

Now, linter passes.

e533b18

Update MFU/MBU code.

6cc511d

Now, mbu tests pass.

e695691

Moved to json file

daa12d4

Added test data

2d25297

finbarrtimbers commented Oct 30, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

finbarrtimbers added 8 commits October 30, 2025 12:01

undid changes and simplified test function.

e1b975b

Merge branch 'main' into fix-modeldims

b48b76d

Updated code.

bca0c4e

Updated code

11b4c9e

test passes

bf1e73c

An attempt at a fix

d9ce0cb

Update code with patches

f1a3d6c

now, tests pass

16b5e9d

finbarrtimbers enabled auto-merge October 30, 2025 20:33

finbarrtimbers mentioned this pull request Oct 31, 2025

Adds MFU to the metrics DPO logs. #1126

Open

Merge branch 'main' into fix-modeldims

b839f17

finbarrtimbers mentioned this pull request Oct 31, 2025

Changes the DPO + finetune scripts to provide progress updates in the Beaker description. #1127

Merged

Merge branch 'main' into fix-modeldims

2aa0ede

mnoukhov requested changes Nov 3, 2025

View reviewed changes

Merge branch 'main' into fix-modeldims

f6ec329

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A few fixes to the MFU/MBU code #1108

A few fixes to the MFU/MBU code #1108

finbarrtimbers commented Oct 22, 2025 •

edited by cursor bot

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

finbarrtimbers Oct 30, 2025

Uh oh!

finbarrtimbers Oct 30, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

mnoukhov Nov 3, 2025

Uh oh!

mnoukhov Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

A few fixes to the MFU/MBU code #1108

Are you sure you want to change the base?

A few fixes to the MFU/MBU code #1108

Conversation

finbarrtimbers commented Oct 22, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

finbarrtimbers Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

finbarrtimbers Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

mnoukhov Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

mnoukhov Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

finbarrtimbers commented Oct 22, 2025 •

edited by cursor bot

Loading