[Eval track] E1 — acceptance-length eval harness: EvalConfig + EvalCache + gates#639
Draft
maocheng23 wants to merge 19 commits into
Draft
[Eval track] E1 — acceptance-length eval harness: EvalConfig + EvalCache + gates#639maocheng23 wants to merge 19 commits into
maocheng23 wants to merge 19 commits into
Conversation
… checkpoint/eval)
Brings the training loop to production parity (plan.md §D / roadmap
domain-refactor.md §D):
- Grad accumulation with no_sync(). TrainerCore now decides the optimizer
boundary BEFORE backward and passes is_boundary to
FSDPTrainingBackend.backward, which wraps the non-boundary micro-steps in
self.module.no_sync() — the FSDP gradient reduction fires once per optimizer
step, not once per micro-step. Single-rank / accumulation_steps=1 is unchanged.
- Full resume. FSDPTrainingBackend.state_dict() now returns the full training
state {model (FSDP FULL_STATE_DICT), optimizer (BF16Optimizer bundles the LR
scheduler), rng (cpu+cuda)}; load_state_dict restores all three. save_checkpoint
persists the export draft weights + optimizer + rng; the domain Trainer gains
resume_from (restore draft weights before the FSDP wrap so the fp32 master is
rebuilt from them, then optimizer+scheduler+rng, then start_step/epoch).
- CheckpointManager (specforge/training/checkpoint.py) — {run_id}-step{step}
layout, keep-last-N rotation that never drops the tracked best, latest/best
symlinks + best_meta.json. Born in its S-home; TrainerController imports it
lazily so the runtime seam stays leaf.
- Evaluator (specforge/eval/evaluator.py) — aggregates per-position accept
counts across the whole eval pass BEFORE the geometric sum, so
simulated_acc_len is batch-size invariant. TrainerController.evaluate delegates
to it; DFlash/Domino scalar accuracy degenerates gracefully.
Gates: test_checkpoint_resume extended (loss-curve continuity across save->resume,
tol for the bf16-master reconstruction), test_no_sync_equiv (2-rank: no_sync path
== per-step reduction, bit-tight), test_evaluator_aggregation (aggregate-before-
geometric-sum, batch-size invariance, CheckpointManager rotation/best/latest).
Backend doubles + the resume test updated for the composite state_dict.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nly (move-only)
Collapse execution code into one home per concern (plan.md §2.3, roadmap E0):
- runtime/training/{trainer,backend,strategy,registry} -> training/{controller,backend,strategies/{base,registry}}
- runtime/inference/{rollout_worker,capture} -> inference/; {sglang,dflash}_adapter -> inference/adapters/{eagle3,dflash}
- modeling/target/{base,factory,eagle3,dflash}_target_model + sglang_backend/ -> inference/target_engine/
- runtime/launch.py -> specforge/launch.py
- modeling/target keeps model defs only (target_head, target_utils, custom_backend)
Zero functional change: pure moves + import-path rewrites; every old module path
keeps a re-export shim for one release. Suite + Phase-B byte gate must pass unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…29-training-managers
…fe Evaluator, durable best tracking Post-review fixes for #637 (adversarial review vs the #630 roadmap): Multi-rank resume correctness: - CheckpointManager.save now writes each rank's optimizer/RNG to its own training_state_rank{r}.pt beside the rank0 shared payload (draft weights + counters + world_size); under FSDP use_orig_params the AdamW moments live on rank-local shard views, so persisting only rank0's copy and restoring it everywhere corrupted the other ranks' moments and collapsed their RNG streams. read_resume_state hands each rank back its own state and fails fast on a world-size mismatch (and on legacy single-file checkpoints at world_size>1). Resume repositions the data stream (plan.md G1 seek()-equivalent): - TrainerController tracks epoch_batch, persists it, and skips the consumed prefix of the interrupted epoch on resume (via the new FeatureDataLoader.seek — no feature materialization — or islice for plain iterables). The domain Trainer threads it for refs-mode runs; the online queue path keeps control-plane skip_ids reconciliation. - test_resume_loss_curve_continuity now trains on DISTINCT batches, so resuming on the wrong data cannot pass (the old fixed-batch form was structurally blind to the data position). Evaluator DP correctness: - The collective schedule is decided globally (SUM of scalar sums, MAX of the per-position length, then ONE stacked count reduce) so a rank with an empty or scalar-only shard issues the same collectives as its peers — no NCCL desync. Scalar accuracy (DFlash/Domino) is now reduced across ranks like everything else. Collectives use the local device via specforge.utils.get_local_device (CPU for gloo). Accumulation stays on-device (one host sync after the loop), and eval/per_position_acc is reported alongside the folded acc-len. - New 2-process gloo gate: cross-rank scalar reduction + ragged-shard schedule symmetry (test_evaluator_aggregation). Durable, decoupled best tracking: - CheckpointManager rehydrates best_score/best_step from best_meta.json (now also carrying "score") on construction, so a restarted process neither rotates away the on-disk best nor lets a worse score overwrite it; update_best is split into score()/is_better(). - fit() tracks the best on EVERY eval (when checkpointing is enabled), persisting a checkpoint on demand when the best eval lands off the save cadence — previously best only fired when eval_interval and save_interval coincided on the same step. Surface + cleanup: - launch: every builder now forwards resume_from/max_checkpoints, so resume and rotation are reachable from the build_* entry points. - TrainerController: drop the max_checkpoints/checkpoint_manager dual config (inject a configured manager; the domain Trainer does); remove the dead eval_step + mode plumbing (evaluate goes through Evaluator on raw forward_loss; validate_batch still runs inside every strategy's forward_loss); drop the did_eval/last_eval_metrics threading. - Trainer._load_resume_state removed: CheckpointManager.read_resume_state is the single checkpoint reader; the loaded dict is dropped after the weight copy instead of living through the FSDP wrap. - CheckpointManager._rotate uses shutil.rmtree; symlink creation is guarded for filesystems without symlink support (dirs + best_meta.json stay the source of truth); save_checkpoint filters draft weights on rank0 only. - test_no_sync_equiv now also counts no_sync deferrals (exactly accumulation_steps-1 per optimizer step) — the roadmap's "one all-reduce per optimizer step" gate, not just weight equality. - DESIGN.md updated (per-rank layout, seek, best semantics; fixed the inverted no_sync sentence). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…29-training-managers # Conflicts: # specforge/runtime/launch.py
…e position, bound comm device Self-review of the fix commit surfaced four defects, all fixed: - fit()'s on-demand best-save gated collective-bearing save_checkpoint on a PER-RANK is_better() whose best_score comes from rank-local filesystem reads (best_meta.json rehydration) — a divergent view (NFS attribute-cache lag, node-local dirs) would deadlock the group. rank0's verdict is now broadcast (_rank0_decision) and update_best gains force=True so every rank follows it. - The mid-epoch position was persisted in BATCH units only, so resuming with a different batch_size silently mis-seeked the stream. The position is now also tracked/persisted in SAMPLES (epoch_samples); the domain Trainer converts samples back to this run's batches and fails fast when the sample position does not divide by the resumed batch size — symmetric with the world-size guard. - Evaluator._comm_device used get_local_device() (LOCAL_RANK env, default 0): a non-torchrun launcher that doesn't export LOCAL_RANK would put every rank's reduction on cuda:0 and break the NCCL communicator. Now uses the rank's BOUND device (torch.cuda.current_device(), set by init_distributed on every path). - The docstring/DESIGN claim that a ragged shard "cannot desynchronize NCCL" over-promised: it holds for the evaluator's OWN reductions (shard content), but a collective forward (FSDP all-gathers per batch) still requires the same eval-batch COUNT per rank. The docs now state the requirement precisely. Also removed the stale TrainerCore.eval_step rows from the DESIGN.md/ARCHITECTURE.md endpoint tables (the method was removed in the previous commit). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A shape-[1] loss tensor (e.g. a parameter-shaped scalar) failed to broadcast into the 0-dim float64 accumulator slot; .mean() normalizes exactly like the trainer's _scalar helper. Caught by the new misaligned-intervals best-tracking gate on the pod. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…0-layout Brings the C/D review-fix commits into the consolidated layout. The old-path shim files stay shims; the fixes are ported verbatim onto the moved homes (training/controller.py, training/backend.py, launch.py — verified byte-identical to the up-29 sources modulo E0's import-path rewrites). Same-path files (domain trainer, checkpoint, evaluator, data plane, control plane, tests) merged directly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…he + gates Completes roadmap E1 (docs/roadmap/eval-and-breadth.md) on top of the E0 layout. The aggregation-before-geometric-sum Evaluator, token-weighted avg_loss, batch-size-invariant counts, and durable best-checkpoint tracking landed in Phase D (#637) — this adds the remaining pieces: - EvalConfig (specforge/eval/evaluator.py): the frozen identity of an eval pass. Its identity_fields() are exactly the plan §4.4 cache key — eval-data path, target path + revision, tokenizer path, chat template, aux hidden-state layer ids, max seqlen. micro_batch_size is deliberately NOT identity (metrics are batch-size invariant, so re-batching must hit the same cache entry). - EvalCache (specforge/eval/cache.py): maps an EvalConfig to a directory of materialized eval features. get_or_produce runs the producer into a staging dir published by atomic rename (the rename IS the completeness marker; produce races collapse to one entry), so repeat evals over an unchanged tuple never recompute hidden states. Deliberately separate from the data/tokenization cache (different keys, different lifecycle, per plan §4.4). - Gates (tests/test_runtime/test_eval_harness.py): * every identity field flips the key; micro_batch_size does not; * hit/produce-once/invalidate semantics; * eval-stream equivalence: identical metrics through a colocated LocalFeatureStore and a disagg SharedDirFeatureStore loader; * loader-level batch-size invariance (GPU): the same eval set at micro-batch sizes 1/4/16 yields the same simulated_acc_len / avg_loss through a real eagle3 forward. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…able pointers, retained eval stream Self-review of the E0/E1 stack confirmed four defects, all fixed: - Scalar-accuracy strategies (DFlash/Domino) were NOT batch-size invariant: the Evaluator averaged per-batch means with equal weight, so a ragged last batch skewed eval/avg_acc and simulated_acc_len with eval batching — breaking the E1 target and best-checkpoint selection. The scalar path now token-weights (sum of correct over sum of tokens, the ttt_length=1 case of the sum/count rule); new invariance gate + DP-gate expectation updated to the weighted semantics. - EvalCache.key digested an unescaped "|"-join, a non-injective encoding (a delimiter inside one field could realign boundaries into another identity's key -> silently serving stale hidden states). Keys now digest the JSON-encoded field tuple; boundary-shift pair added to the key gate. - CheckpointManager pointers were absolute-target symlinks, so resume_from=<dir>/best broke (or silently followed the old path) after relocating output_dir. Targets are now relative (basename). - The eval-stream equivalence gate blessed consume-once stores: the loader releases each feature after cloning, so a second evaluate() over the same stream would die mid-training. The shared-dir leg now constructs retain_on_release=True and evaluates twice; the mem:// consume-once semantics are called out where a repeatable stream is needed. - EvalCache.for_config reads EvalConfig.cache_dir (previously a dangling field), and the module docstring carries an explicit scope note: run-surface wiring of the cache lands with Phase E's config+CLI. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…t/latest pointers Two follow-ups from the E0/E1 adversarial review that belong to this PR's code: - Scalar-accuracy strategies (DFlash/Domino) were not batch-size invariant: the Evaluator averaged per-batch means with equal weight, so a ragged last batch skewed eval/avg_acc and simulated_acc_len with eval batching. The scalar path now token-weights (sum of correct over sum of tokens — the ttt_length=1 case of the sum/count rule); new invariance gate, DP-gate expectation updated to the weighted semantics. - CheckpointManager wrote absolute-target symlinks, so resume_from=<dir>/best broke (or silently followed the old absolute path) after relocating output_dir; targets are now relative. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
…29-training-managers
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…0-layout # Conflicts: # specforge/runtime/training/trainer.py
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
1e3e516 to
3ae106c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
E1 — Acceptance-length eval harness:
EvalConfig+EvalCache+ gatesImplements E1 of the eval-and-breadth track (#630,
docs/roadmap/eval-and-breadth.md). Stacked on #638 (E0 layout) so the eval code lives in its final home (specforge/eval/).Landed split: the Evaluator itself — aggregate-before-geometric-sum, token-weighted
avg_lossand token-weighted scalar accuracy (thettt_length=1degenerate case), DP-reduced metrics, durable best-checkpoint tracking — landed with Phase D (#637). This PR adds the remaining E1 pieces:EvalConfig(specforge/eval/evaluator.py) — the frozen identity of an eval pass. Itsidentity_fields()are exactly the plan §4.4 cache key: eval-data path, target path + revision, tokenizer path, chat template, aux hidden-state layer ids, max seqlen.micro_batch_sizeis deliberately not identity — metrics are batch-size invariant, so re-batching the same eval set hits the same cache entry.EvalCache(specforge/eval/cache.py) — maps anEvalConfigto a directory of materialized eval features.get_or_produceruns the producer into a staging dir published by atomic rename (the rename is the completeness marker; produce races collapse to one entry). Keys digest the JSON-encoded field tuple, so the encoding is injective (a|inside one field can never realign boundaries into another identity's key). Deliberately separate from the tokenization cache (different keys, different lifecycle, per plan §4.4).hidden_states_path, so the cache is consumed programmatically (EvalCache.for_config+get_or_produce) until Phase E's typed config + CLI own eval-data production — noted in the module docstring and the roadmap.Gates (
tests/test_runtime/test_eval_harness.py)micro_batch_sizedoes not; boundary-shift (|-bearing) identities do not collide.LocalFeatureStoreand a disaggSharedDirFeatureStore— with the shared leg constructedretain_on_release=Trueand evaluated twice, pinning that a repeat-eval stream survives (mem://put() is consume-once by design and called out as such).simulated_acc_len/avg_loss/avg_acc.Validation
Full
tests/test_runtimeon 8×H200: 244 OK (2 skip, 1 xfail) = the E0 baseline 240 + 4 new gates. An adversarial review pass over the E1 diff surfaced 4 confirmed defects (scalar-path batch-size variance, non-injective cache key, absolute-symlink pointers, a consume-once store blessed by the equivalence gate) — all fixed here or in #637 where they belonged.🤖 Generated with Claude Code