Skip to content

[Eval track] E1 — acceptance-length eval harness: EvalConfig + EvalCache + gates#639

Draft
maocheng23 wants to merge 19 commits into
dataflow-up-30-e0-layoutfrom
dataflow-e1-eval-harness
Draft

[Eval track] E1 — acceptance-length eval harness: EvalConfig + EvalCache + gates#639
maocheng23 wants to merge 19 commits into
dataflow-up-30-e0-layoutfrom
dataflow-e1-eval-harness

Conversation

@maocheng23

Copy link
Copy Markdown
Collaborator

E1 — Acceptance-length eval harness: EvalConfig + EvalCache + gates

Implements E1 of the eval-and-breadth track (#630, docs/roadmap/eval-and-breadth.md). Stacked on #638 (E0 layout) so the eval code lives in its final home (specforge/eval/).

Landed split: the Evaluator itself — aggregate-before-geometric-sum, token-weighted avg_loss and token-weighted scalar accuracy (the ttt_length=1 degenerate case), DP-reduced metrics, durable best-checkpoint tracking — landed with Phase D (#637). This PR adds the remaining E1 pieces:

  • EvalConfig (specforge/eval/evaluator.py) — the frozen identity of an eval pass. Its identity_fields() are exactly the plan §4.4 cache key: eval-data path, target path + revision, tokenizer path, chat template, aux hidden-state layer ids, max seqlen. micro_batch_size is deliberately not identity — metrics are batch-size invariant, so re-batching the same eval set hits the same cache entry.
  • EvalCache (specforge/eval/cache.py) — maps an EvalConfig to a directory of materialized eval features. get_or_produce runs the producer into a staging dir published by atomic rename (the rename is the completeness marker; produce races collapse to one entry). Keys digest the JSON-encoded field tuple, so the encoding is injective (a | inside one field can never realign boundaries into another identity's key). Deliberately separate from the tokenization cache (different keys, different lifecycle, per plan §4.4).
  • Scope note: the launch builders still take a pre-produced hidden_states_path, so the cache is consumed programmatically (EvalCache.for_config + get_or_produce) until Phase E's typed config + CLI own eval-data production — noted in the module docstring and the roadmap.

Gates (tests/test_runtime/test_eval_harness.py)

  • Cache key: flipping any identity field changes the key; micro_batch_size does not; boundary-shift (|-bearing) identities do not collide.
  • Hit/produce-once/invalidate: the producer runs exactly once per identity; invalidation reproduces.
  • Eval-stream equivalence: identical metrics through a colocated LocalFeatureStore and a disagg SharedDirFeatureStore — with the shared leg constructed retain_on_release=True and evaluated twice, pinning that a repeat-eval stream survives (mem:// put() is consume-once by design and called out as such).
  • Loader-level batch-size invariance (GPU): the same eval set through a real eagle3 forward at micro-batch sizes 1/4/16 yields the same simulated_acc_len / avg_loss / avg_acc.

Validation

Full tests/test_runtime on 8×H200: 244 OK (2 skip, 1 xfail) = the E0 baseline 240 + 4 new gates. An adversarial review pass over the E1 diff surfaced 4 confirmed defects (scalar-path batch-size variance, non-injective cache key, absolute-symlink pointers, a consume-once store blessed by the equivalence gate) — all fixed here or in #637 where they belonged.

🤖 Generated with Claude Code

maocheng23 and others added 13 commits July 1, 2026 02:23
… checkpoint/eval)

Brings the training loop to production parity (plan.md §D / roadmap
domain-refactor.md §D):

- Grad accumulation with no_sync(). TrainerCore now decides the optimizer
  boundary BEFORE backward and passes is_boundary to
  FSDPTrainingBackend.backward, which wraps the non-boundary micro-steps in
  self.module.no_sync() — the FSDP gradient reduction fires once per optimizer
  step, not once per micro-step. Single-rank / accumulation_steps=1 is unchanged.
- Full resume. FSDPTrainingBackend.state_dict() now returns the full training
  state {model (FSDP FULL_STATE_DICT), optimizer (BF16Optimizer bundles the LR
  scheduler), rng (cpu+cuda)}; load_state_dict restores all three. save_checkpoint
  persists the export draft weights + optimizer + rng; the domain Trainer gains
  resume_from (restore draft weights before the FSDP wrap so the fp32 master is
  rebuilt from them, then optimizer+scheduler+rng, then start_step/epoch).
- CheckpointManager (specforge/training/checkpoint.py) — {run_id}-step{step}
  layout, keep-last-N rotation that never drops the tracked best, latest/best
  symlinks + best_meta.json. Born in its S-home; TrainerController imports it
  lazily so the runtime seam stays leaf.
- Evaluator (specforge/eval/evaluator.py) — aggregates per-position accept
  counts across the whole eval pass BEFORE the geometric sum, so
  simulated_acc_len is batch-size invariant. TrainerController.evaluate delegates
  to it; DFlash/Domino scalar accuracy degenerates gracefully.

Gates: test_checkpoint_resume extended (loss-curve continuity across save->resume,
tol for the bf16-master reconstruction), test_no_sync_equiv (2-rank: no_sync path
== per-step reduction, bit-tight), test_evaluator_aggregation (aggregate-before-
geometric-sum, batch-size invariance, CheckpointManager rotation/best/latest).
Backend doubles + the resume test updated for the composite state_dict.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nly (move-only)

Collapse execution code into one home per concern (plan.md §2.3, roadmap E0):
- runtime/training/{trainer,backend,strategy,registry} -> training/{controller,backend,strategies/{base,registry}}
- runtime/inference/{rollout_worker,capture} -> inference/; {sglang,dflash}_adapter -> inference/adapters/{eagle3,dflash}
- modeling/target/{base,factory,eagle3,dflash}_target_model + sglang_backend/ -> inference/target_engine/
- runtime/launch.py -> specforge/launch.py
- modeling/target keeps model defs only (target_head, target_utils, custom_backend)

Zero functional change: pure moves + import-path rewrites; every old module path
keeps a re-export shim for one release. Suite + Phase-B byte gate must pass unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…fe Evaluator, durable best tracking

Post-review fixes for #637 (adversarial review vs the #630 roadmap):

Multi-rank resume correctness:
- CheckpointManager.save now writes each rank's optimizer/RNG to its own
  training_state_rank{r}.pt beside the rank0 shared payload (draft
  weights + counters + world_size); under FSDP use_orig_params the AdamW
  moments live on rank-local shard views, so persisting only rank0's
  copy and restoring it everywhere corrupted the other ranks' moments
  and collapsed their RNG streams. read_resume_state hands each rank
  back its own state and fails fast on a world-size mismatch (and on
  legacy single-file checkpoints at world_size>1).

Resume repositions the data stream (plan.md G1 seek()-equivalent):
- TrainerController tracks epoch_batch, persists it, and skips the
  consumed prefix of the interrupted epoch on resume (via the new
  FeatureDataLoader.seek — no feature materialization — or islice for
  plain iterables). The domain Trainer threads it for refs-mode runs;
  the online queue path keeps control-plane skip_ids reconciliation.
- test_resume_loss_curve_continuity now trains on DISTINCT batches, so
  resuming on the wrong data cannot pass (the old fixed-batch form was
  structurally blind to the data position).

Evaluator DP correctness:
- The collective schedule is decided globally (SUM of scalar sums, MAX
  of the per-position length, then ONE stacked count reduce) so a rank
  with an empty or scalar-only shard issues the same collectives as its
  peers — no NCCL desync. Scalar accuracy (DFlash/Domino) is now
  reduced across ranks like everything else. Collectives use the local
  device via specforge.utils.get_local_device (CPU for gloo).
  Accumulation stays on-device (one host sync after the loop), and
  eval/per_position_acc is reported alongside the folded acc-len.
- New 2-process gloo gate: cross-rank scalar reduction + ragged-shard
  schedule symmetry (test_evaluator_aggregation).

Durable, decoupled best tracking:
- CheckpointManager rehydrates best_score/best_step from best_meta.json
  (now also carrying "score") on construction, so a restarted process
  neither rotates away the on-disk best nor lets a worse score
  overwrite it; update_best is split into score()/is_better().
- fit() tracks the best on EVERY eval (when checkpointing is enabled),
  persisting a checkpoint on demand when the best eval lands off the
  save cadence — previously best only fired when eval_interval and
  save_interval coincided on the same step.

Surface + cleanup:
- launch: every builder now forwards resume_from/max_checkpoints, so
  resume and rotation are reachable from the build_* entry points.
- TrainerController: drop the max_checkpoints/checkpoint_manager dual
  config (inject a configured manager; the domain Trainer does);
  remove the dead eval_step + mode plumbing (evaluate goes through
  Evaluator on raw forward_loss; validate_batch still runs inside every
  strategy's forward_loss); drop the did_eval/last_eval_metrics
  threading.
- Trainer._load_resume_state removed: CheckpointManager.read_resume_state
  is the single checkpoint reader; the loaded dict is dropped after the
  weight copy instead of living through the FSDP wrap.
- CheckpointManager._rotate uses shutil.rmtree; symlink creation is
  guarded for filesystems without symlink support (dirs + best_meta.json
  stay the source of truth); save_checkpoint filters draft weights on
  rank0 only.
- test_no_sync_equiv now also counts no_sync deferrals (exactly
  accumulation_steps-1 per optimizer step) — the roadmap's "one
  all-reduce per optimizer step" gate, not just weight equality.
- DESIGN.md updated (per-rank layout, seek, best semantics; fixed the
  inverted no_sync sentence).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…29-training-managers

# Conflicts:
#	specforge/runtime/launch.py
…e position, bound comm device

Self-review of the fix commit surfaced four defects, all fixed:

- fit()'s on-demand best-save gated collective-bearing save_checkpoint
  on a PER-RANK is_better() whose best_score comes from rank-local
  filesystem reads (best_meta.json rehydration) — a divergent view
  (NFS attribute-cache lag, node-local dirs) would deadlock the group.
  rank0's verdict is now broadcast (_rank0_decision) and update_best
  gains force=True so every rank follows it.
- The mid-epoch position was persisted in BATCH units only, so resuming
  with a different batch_size silently mis-seeked the stream. The
  position is now also tracked/persisted in SAMPLES (epoch_samples);
  the domain Trainer converts samples back to this run's batches and
  fails fast when the sample position does not divide by the resumed
  batch size — symmetric with the world-size guard.
- Evaluator._comm_device used get_local_device() (LOCAL_RANK env,
  default 0): a non-torchrun launcher that doesn't export LOCAL_RANK
  would put every rank's reduction on cuda:0 and break the NCCL
  communicator. Now uses the rank's BOUND device
  (torch.cuda.current_device(), set by init_distributed on every path).
- The docstring/DESIGN claim that a ragged shard "cannot desynchronize
  NCCL" over-promised: it holds for the evaluator's OWN reductions
  (shard content), but a collective forward (FSDP all-gathers per
  batch) still requires the same eval-batch COUNT per rank. The docs
  now state the requirement precisely. Also removed the stale
  TrainerCore.eval_step rows from the DESIGN.md/ARCHITECTURE.md
  endpoint tables (the method was removed in the previous commit).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A shape-[1] loss tensor (e.g. a parameter-shaped scalar) failed to
broadcast into the 0-dim float64 accumulator slot; .mean() normalizes
exactly like the trainer's _scalar helper. Caught by the new
misaligned-intervals best-tracking gate on the pod.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…0-layout

Brings the C/D review-fix commits into the consolidated layout. The
old-path shim files stay shims; the fixes are ported verbatim onto the
moved homes (training/controller.py, training/backend.py, launch.py —
verified byte-identical to the up-29 sources modulo E0's import-path
rewrites). Same-path files (domain trainer, checkpoint, evaluator,
data plane, control plane, tests) merged directly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…he + gates

Completes roadmap E1 (docs/roadmap/eval-and-breadth.md) on top of the
E0 layout. The aggregation-before-geometric-sum Evaluator, token-weighted
avg_loss, batch-size-invariant counts, and durable best-checkpoint
tracking landed in Phase D (#637) — this adds the remaining pieces:

- EvalConfig (specforge/eval/evaluator.py): the frozen identity of an
  eval pass. Its identity_fields() are exactly the plan §4.4 cache key
  — eval-data path, target path + revision, tokenizer path, chat
  template, aux hidden-state layer ids, max seqlen. micro_batch_size is
  deliberately NOT identity (metrics are batch-size invariant, so
  re-batching must hit the same cache entry).
- EvalCache (specforge/eval/cache.py): maps an EvalConfig to a directory
  of materialized eval features. get_or_produce runs the producer into
  a staging dir published by atomic rename (the rename IS the
  completeness marker; produce races collapse to one entry), so repeat
  evals over an unchanged tuple never recompute hidden states.
  Deliberately separate from the data/tokenization cache (different
  keys, different lifecycle, per plan §4.4).
- Gates (tests/test_runtime/test_eval_harness.py):
  * every identity field flips the key; micro_batch_size does not;
  * hit/produce-once/invalidate semantics;
  * eval-stream equivalence: identical metrics through a colocated
    LocalFeatureStore and a disagg SharedDirFeatureStore loader;
  * loader-level batch-size invariance (GPU): the same eval set at
    micro-batch sizes 1/4/16 yields the same simulated_acc_len /
    avg_loss through a real eagle3 forward.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…able pointers, retained eval stream

Self-review of the E0/E1 stack confirmed four defects, all fixed:

- Scalar-accuracy strategies (DFlash/Domino) were NOT batch-size
  invariant: the Evaluator averaged per-batch means with equal weight,
  so a ragged last batch skewed eval/avg_acc and simulated_acc_len with
  eval batching — breaking the E1 target and best-checkpoint selection.
  The scalar path now token-weights (sum of correct over sum of tokens,
  the ttt_length=1 case of the sum/count rule); new invariance gate +
  DP-gate expectation updated to the weighted semantics.
- EvalCache.key digested an unescaped "|"-join, a non-injective
  encoding (a delimiter inside one field could realign boundaries into
  another identity's key -> silently serving stale hidden states). Keys
  now digest the JSON-encoded field tuple; boundary-shift pair added to
  the key gate.
- CheckpointManager pointers were absolute-target symlinks, so
  resume_from=<dir>/best broke (or silently followed the old path)
  after relocating output_dir. Targets are now relative (basename).
- The eval-stream equivalence gate blessed consume-once stores: the
  loader releases each feature after cloning, so a second evaluate()
  over the same stream would die mid-training. The shared-dir leg now
  constructs retain_on_release=True and evaluates twice; the mem://
  consume-once semantics are called out where a repeatable stream is
  needed.
- EvalCache.for_config reads EvalConfig.cache_dir (previously a
  dangling field), and the module docstring carries an explicit scope
  note: run-surface wiring of the cache lands with Phase E's config+CLI.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…t/latest pointers

Two follow-ups from the E0/E1 adversarial review that belong to this
PR's code:

- Scalar-accuracy strategies (DFlash/Domino) were not batch-size
  invariant: the Evaluator averaged per-batch means with equal weight,
  so a ragged last batch skewed eval/avg_acc and simulated_acc_len with
  eval batching. The scalar path now token-weights (sum of correct over
  sum of tokens — the ttt_length=1 case of the sum/count rule); new
  invariance gate, DP-gate expectation updated to the weighted
  semantics.
- CheckpointManager wrote absolute-target symlinks, so
  resume_from=<dir>/best broke (or silently followed the old absolute
  path) after relocating output_dir; targets are now relative.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

maocheng23 and others added 6 commits July 2, 2026 00:52
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…0-layout

# Conflicts:
#	specforge/runtime/training/trainer.py
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@maocheng23 maocheng23 force-pushed the dataflow-up-30-e0-layout branch 2 times, most recently from 1e3e516 to 3ae106c Compare July 4, 2026 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant