[Eval track] E1 — acceptance-length eval harness: EvalConfig + EvalCache + gates by maocheng23 · Pull Request #639 · sgl-project/SpecForge

maocheng23 · 2026-07-01T23:37:46Z

E1 — Acceptance-length eval harness: `EvalConfig` + `EvalCache` + gates

Implements E1 of the eval-and-breadth track (#630, docs/roadmap/eval-and-breadth.md). Stacked on #638 (E0 layout) so the eval code lives in its final home (specforge/eval/).

Landed split: the Evaluator itself — aggregate-before-geometric-sum, token-weighted avg_loss and token-weighted scalar accuracy (the ttt_length=1 degenerate case), DP-reduced metrics, durable best-checkpoint tracking — landed with Phase D (#637). This PR adds the remaining E1 pieces:

EvalConfig (specforge/eval/evaluator.py) — the frozen identity of an eval pass. Its identity_fields() are exactly the plan §4.4 cache key: eval-data path, target path + revision, tokenizer path, chat template, aux hidden-state layer ids, max seqlen. micro_batch_size is deliberately not identity — metrics are batch-size invariant, so re-batching the same eval set hits the same cache entry.
EvalCache (specforge/eval/cache.py) — maps an EvalConfig to a directory of materialized eval features. get_or_produce runs the producer into a staging dir published by atomic rename (the rename is the completeness marker; produce races collapse to one entry). Keys digest the JSON-encoded field tuple, so the encoding is injective (a | inside one field can never realign boundaries into another identity's key). Deliberately separate from the tokenization cache (different keys, different lifecycle, per plan §4.4).
Scope note: the launch builders still take a pre-produced hidden_states_path, so the cache is consumed programmatically (EvalCache.for_config + get_or_produce) until Phase E's typed config + CLI own eval-data production — noted in the module docstring and the roadmap.

Gates (`tests/test_runtime/test_eval_harness.py`)

Cache key: flipping any identity field changes the key; micro_batch_size does not; boundary-shift (|-bearing) identities do not collide.
Hit/produce-once/invalidate: the producer runs exactly once per identity; invalidation reproduces.
Eval-stream equivalence: identical metrics through a colocated LocalFeatureStore and a disagg SharedDirFeatureStore — with the shared leg constructed retain_on_release=True and evaluated twice, pinning that a repeat-eval stream survives (mem:// put() is consume-once by design and called out as such).
Loader-level batch-size invariance (GPU): the same eval set through a real eagle3 forward at micro-batch sizes 1/4/16 yields the same simulated_acc_len / avg_loss / avg_acc.

Validation

Full tests/test_runtime on 8×H200: 244 OK (2 skip, 1 xfail) = the E0 baseline 240 + 4 new gates. An adversarial review pass over the E1 diff surfaced 4 confirmed defects (scalar-path batch-size variance, non-injective cache key, absolute-symlink pointers, a consume-once store blessed by the equivalence gate) — all fixed here or in #637 where they belonged.

🤖 Generated with Claude Code

… checkpoint/eval) Brings the training loop to production parity (plan.md §D / roadmap domain-refactor.md §D): - Grad accumulation with no_sync(). TrainerCore now decides the optimizer boundary BEFORE backward and passes is_boundary to FSDPTrainingBackend.backward, which wraps the non-boundary micro-steps in self.module.no_sync() — the FSDP gradient reduction fires once per optimizer step, not once per micro-step. Single-rank / accumulation_steps=1 is unchanged. - Full resume. FSDPTrainingBackend.state_dict() now returns the full training state {model (FSDP FULL_STATE_DICT), optimizer (BF16Optimizer bundles the LR scheduler), rng (cpu+cuda)}; load_state_dict restores all three. save_checkpoint persists the export draft weights + optimizer + rng; the domain Trainer gains resume_from (restore draft weights before the FSDP wrap so the fp32 master is rebuilt from them, then optimizer+scheduler+rng, then start_step/epoch). - CheckpointManager (specforge/training/checkpoint.py) — {run_id}-step{step} layout, keep-last-N rotation that never drops the tracked best, latest/best symlinks + best_meta.json. Born in its S-home; TrainerController imports it lazily so the runtime seam stays leaf. - Evaluator (specforge/eval/evaluator.py) — aggregates per-position accept counts across the whole eval pass BEFORE the geometric sum, so simulated_acc_len is batch-size invariant. TrainerController.evaluate delegates to it; DFlash/Domino scalar accuracy degenerates gracefully. Gates: test_checkpoint_resume extended (loss-curve continuity across save->resume, tol for the bf16-master reconstruction), test_no_sync_equiv (2-rank: no_sync path == per-step reduction, bit-tight), test_evaluator_aggregation (aggregate-before- geometric-sum, batch-size invariance, CheckpointManager rotation/best/latest). Backend doubles + the resume test updated for the composite state_dict. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…nly (move-only) Collapse execution code into one home per concern (plan.md §2.3, roadmap E0): - runtime/training/{trainer,backend,strategy,registry} -> training/{controller,backend,strategies/{base,registry}} - runtime/inference/{rollout_worker,capture} -> inference/; {sglang,dflash}_adapter -> inference/adapters/{eagle3,dflash} - modeling/target/{base,factory,eagle3,dflash}_target_model + sglang_backend/ -> inference/target_engine/ - runtime/launch.py -> specforge/launch.py - modeling/target keeps model defs only (target_head, target_utils, custom_backend) Zero functional change: pure moves + import-path rewrites; every old module path keeps a re-export shim for one release. Suite + Phase-B byte gate must pass unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…29-training-managers

…fe Evaluator, durable best tracking Post-review fixes for #637 (adversarial review vs the #630 roadmap): Multi-rank resume correctness: - CheckpointManager.save now writes each rank's optimizer/RNG to its own training_state_rank{r}.pt beside the rank0 shared payload (draft weights + counters + world_size); under FSDP use_orig_params the AdamW moments live on rank-local shard views, so persisting only rank0's copy and restoring it everywhere corrupted the other ranks' moments and collapsed their RNG streams. read_resume_state hands each rank back its own state and fails fast on a world-size mismatch (and on legacy single-file checkpoints at world_size>1). Resume repositions the data stream (plan.md G1 seek()-equivalent): - TrainerController tracks epoch_batch, persists it, and skips the consumed prefix of the interrupted epoch on resume (via the new FeatureDataLoader.seek — no feature materialization — or islice for plain iterables). The domain Trainer threads it for refs-mode runs; the online queue path keeps control-plane skip_ids reconciliation. - test_resume_loss_curve_continuity now trains on DISTINCT batches, so resuming on the wrong data cannot pass (the old fixed-batch form was structurally blind to the data position). Evaluator DP correctness: - The collective schedule is decided globally (SUM of scalar sums, MAX of the per-position length, then ONE stacked count reduce) so a rank with an empty or scalar-only shard issues the same collectives as its peers — no NCCL desync. Scalar accuracy (DFlash/Domino) is now reduced across ranks like everything else. Collectives use the local device via specforge.utils.get_local_device (CPU for gloo). Accumulation stays on-device (one host sync after the loop), and eval/per_position_acc is reported alongside the folded acc-len. - New 2-process gloo gate: cross-rank scalar reduction + ragged-shard schedule symmetry (test_evaluator_aggregation). Durable, decoupled best tracking: - CheckpointManager rehydrates best_score/best_step from best_meta.json (now also carrying "score") on construction, so a restarted process neither rotates away the on-disk best nor lets a worse score overwrite it; update_best is split into score()/is_better(). - fit() tracks the best on EVERY eval (when checkpointing is enabled), persisting a checkpoint on demand when the best eval lands off the save cadence — previously best only fired when eval_interval and save_interval coincided on the same step. Surface + cleanup: - launch: every builder now forwards resume_from/max_checkpoints, so resume and rotation are reachable from the build_* entry points. - TrainerController: drop the max_checkpoints/checkpoint_manager dual config (inject a configured manager; the domain Trainer does); remove the dead eval_step + mode plumbing (evaluate goes through Evaluator on raw forward_loss; validate_batch still runs inside every strategy's forward_loss); drop the did_eval/last_eval_metrics threading. - Trainer._load_resume_state removed: CheckpointManager.read_resume_state is the single checkpoint reader; the loaded dict is dropped after the weight copy instead of living through the FSDP wrap. - CheckpointManager._rotate uses shutil.rmtree; symlink creation is guarded for filesystems without symlink support (dirs + best_meta.json stay the source of truth); save_checkpoint filters draft weights on rank0 only. - test_no_sync_equiv now also counts no_sync deferrals (exactly accumulation_steps-1 per optimizer step) — the roadmap's "one all-reduce per optimizer step" gate, not just weight equality. - DESIGN.md updated (per-rank layout, seek, best semantics; fixed the inverted no_sync sentence). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…29-training-managers # Conflicts: # specforge/runtime/launch.py

…e position, bound comm device Self-review of the fix commit surfaced four defects, all fixed: - fit()'s on-demand best-save gated collective-bearing save_checkpoint on a PER-RANK is_better() whose best_score comes from rank-local filesystem reads (best_meta.json rehydration) — a divergent view (NFS attribute-cache lag, node-local dirs) would deadlock the group. rank0's verdict is now broadcast (_rank0_decision) and update_best gains force=True so every rank follows it. - The mid-epoch position was persisted in BATCH units only, so resuming with a different batch_size silently mis-seeked the stream. The position is now also tracked/persisted in SAMPLES (epoch_samples); the domain Trainer converts samples back to this run's batches and fails fast when the sample position does not divide by the resumed batch size — symmetric with the world-size guard. - Evaluator._comm_device used get_local_device() (LOCAL_RANK env, default 0): a non-torchrun launcher that doesn't export LOCAL_RANK would put every rank's reduction on cuda:0 and break the NCCL communicator. Now uses the rank's BOUND device (torch.cuda.current_device(), set by init_distributed on every path). - The docstring/DESIGN claim that a ragged shard "cannot desynchronize NCCL" over-promised: it holds for the evaluator's OWN reductions (shard content), but a collective forward (FSDP all-gathers per batch) still requires the same eval-batch COUNT per rank. The docs now state the requirement precisely. Also removed the stale TrainerCore.eval_step rows from the DESIGN.md/ARCHITECTURE.md endpoint tables (the method was removed in the previous commit). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

A shape-[1] loss tensor (e.g. a parameter-shaped scalar) failed to broadcast into the 0-dim float64 accumulator slot; .mean() normalizes exactly like the trainer's _scalar helper. Caught by the new misaligned-intervals best-tracking gate on the pod. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…0-layout Brings the C/D review-fix commits into the consolidated layout. The old-path shim files stay shims; the fixes are ported verbatim onto the moved homes (training/controller.py, training/backend.py, launch.py — verified byte-identical to the up-29 sources modulo E0's import-path rewrites). Same-path files (domain trainer, checkpoint, evaluator, data plane, control plane, tests) merged directly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…he + gates Completes roadmap E1 (docs/roadmap/eval-and-breadth.md) on top of the E0 layout. The aggregation-before-geometric-sum Evaluator, token-weighted avg_loss, batch-size-invariant counts, and durable best-checkpoint tracking landed in Phase D (#637) — this adds the remaining pieces: - EvalConfig (specforge/eval/evaluator.py): the frozen identity of an eval pass. Its identity_fields() are exactly the plan §4.4 cache key — eval-data path, target path + revision, tokenizer path, chat template, aux hidden-state layer ids, max seqlen. micro_batch_size is deliberately NOT identity (metrics are batch-size invariant, so re-batching must hit the same cache entry). - EvalCache (specforge/eval/cache.py): maps an EvalConfig to a directory of materialized eval features. get_or_produce runs the producer into a staging dir published by atomic rename (the rename IS the completeness marker; produce races collapse to one entry), so repeat evals over an unchanged tuple never recompute hidden states. Deliberately separate from the data/tokenization cache (different keys, different lifecycle, per plan §4.4). - Gates (tests/test_runtime/test_eval_harness.py): * every identity field flips the key; micro_batch_size does not; * hit/produce-once/invalidate semantics; * eval-stream equivalence: identical metrics through a colocated LocalFeatureStore and a disagg SharedDirFeatureStore loader; * loader-level batch-size invariance (GPU): the same eval set at micro-batch sizes 1/4/16 yields the same simulated_acc_len / avg_loss through a real eagle3 forward. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…able pointers, retained eval stream Self-review of the E0/E1 stack confirmed four defects, all fixed: - Scalar-accuracy strategies (DFlash/Domino) were NOT batch-size invariant: the Evaluator averaged per-batch means with equal weight, so a ragged last batch skewed eval/avg_acc and simulated_acc_len with eval batching — breaking the E1 target and best-checkpoint selection. The scalar path now token-weights (sum of correct over sum of tokens, the ttt_length=1 case of the sum/count rule); new invariance gate + DP-gate expectation updated to the weighted semantics. - EvalCache.key digested an unescaped "|"-join, a non-injective encoding (a delimiter inside one field could realign boundaries into another identity's key -> silently serving stale hidden states). Keys now digest the JSON-encoded field tuple; boundary-shift pair added to the key gate. - CheckpointManager pointers were absolute-target symlinks, so resume_from=<dir>/best broke (or silently followed the old path) after relocating output_dir. Targets are now relative (basename). - The eval-stream equivalence gate blessed consume-once stores: the loader releases each feature after cloning, so a second evaluate() over the same stream would die mid-training. The shared-dir leg now constructs retain_on_release=True and evaluates twice; the mem:// consume-once semantics are called out where a repeatable stream is needed. - EvalCache.for_config reads EvalConfig.cache_dir (previously a dangling field), and the module docstring carries an explicit scope note: run-surface wiring of the cache lands with Phase E's config+CLI. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…t/latest pointers Two follow-ups from the E0/E1 adversarial review that belong to this PR's code: - Scalar-accuracy strategies (DFlash/Domino) were not batch-size invariant: the Evaluator averaged per-batch means with equal weight, so a ragged last batch skewed eval/avg_acc and simulated_acc_len with eval batching. The scalar path now token-weights (sum of correct over sum of tokens — the ttt_length=1 case of the sum/count rule); new invariance gate, DP-gate expectation updated to the weighted semantics. - CheckpointManager wrote absolute-target symlinks, so resume_from=<dir>/best broke (or silently followed the old absolute path) after relocating output_dir; targets are now relative. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…0-layout

gemini-code-assist · 2026-07-01T23:37:49Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…29-training-managers

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…0-layout # Conflicts: # specforge/runtime/training/trainer.py

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

maocheng23 and others added 13 commits July 1, 2026 02:23

Merge branch 'dataflow-up-28-colocated-lightweight' into dataflow-up-…

b8ffa2a

…29-training-managers

Merge branch 'dataflow-up-28-colocated-lightweight' into dataflow-up-…

04bb60a

…29-training-managers # Conflicts: # specforge/runtime/launch.py

Merge branch 'dataflow-up-29-training-managers' into dataflow-up-30-e…

0e50c9a

…0-layout

Merge branch 'dataflow-up-30-e0-layout' into dataflow-e1-eval-harness

5c11720

maocheng23 and others added 6 commits July 2, 2026 00:52

Merge branch 'dataflow-up-28-colocated-lightweight' into dataflow-up-…

33a6bf1

…29-training-managers

style: apply pre-commit (black/isort/autoflake)

0d6a456

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Merge branch 'dataflow-up-29-training-managers' into dataflow-up-30-e…

ad2ee7e

…0-layout # Conflicts: # specforge/runtime/training/trainer.py

style: apply pre-commit (black/isort/autoflake)

80b906b

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Merge branch 'dataflow-up-30-e0-layout' into dataflow-e1-eval-harness

5297d46

style: apply pre-commit (black/isort/autoflake)

78822bd

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

maocheng23 force-pushed the dataflow-up-30-e0-layout branch 2 times, most recently from 1e3e516 to 3ae106c Compare July 4, 2026 07:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Eval track] E1 — acceptance-length eval harness: EvalConfig + EvalCache + gates#639

[Eval track] E1 — acceptance-length eval harness: EvalConfig + EvalCache + gates#639
maocheng23 wants to merge 19 commits into
dataflow-up-30-e0-layoutfrom
dataflow-e1-eval-harness

maocheng23 commented Jul 1, 2026

Uh oh!

gemini-code-assist Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

maocheng23 commented Jul 1, 2026

E1 — Acceptance-length eval harness: EvalConfig + EvalCache + gates

Gates (tests/test_runtime/test_eval_harness.py)

Validation

Uh oh!

gemini-code-assist Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

E1 — Acceptance-length eval harness: `EvalConfig` + `EvalCache` + gates

Gates (`tests/test_runtime/test_eval_harness.py`)