[DataFlow runtime] DFlash end-to-end on the composable launch (offline + online) by maocheng23 · Pull Request #628 · sgl-project/SpecForge

maocheng23 · 2026-06-30T19:07:28Z

What

DFlash trains end-to-end (offline + online) through the composable launch from the parent PR — via a StrategySpec entry + a DFlashAdapter, with ZERO launch.py changes.

Changes

registry.py: dflash spec — offline reader (OfflineManifestReader with dflash feature_keys, no aux/target swap), per-sample transform, padding collate; online via DFlashAdapter; supports_online=True.
specforge/runtime/inference/dflash_adapter.py (new): wraps generate_dflash_data, emits {input_ids, hidden_states, loss_mask}. verify_capture self-skips the eagle3 aux/target checks (different feature names + __aux_layer_ids__=None).
tests/_fixtures.py: write_offline_files_dflash + build_dflash (tiny Qwen3 target → DFlash draft + TargetEmbeddingsAndHead → OnlineDFlashModel).
tests/test_dflash_launch.py + test_dflash_online_launch.py (new, GPU): offline and online dflash train end-to-end through FSDP.

Note

DFlash is online-only in production today (no offline dumper exists — prepare_hidden_states.py is eagle3-only), so the offline path is exercised with synthetic fixtures while online is its real workflow.

Testing

Part of the 197 tests OK suite run at the stack tip (sci-h200 / H200).

Stacked on the composable-launch PR. Part 2/3.

🤖 Generated with Claude Code

gemini-code-assist · 2026-06-30T19:07:31Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

… W3′ naming Review fixes (verified against the files): - Status (confirmed): stop calling the in-review composable-launch stack (#627/#628/#629) "landed"/"DONE"/"done". Split the genuinely-merged spine from the in-review stack in §1; one consistent "in review" label in §1/Phase A/success table and across the roadmap (README, Phase A). Leave the spine's "landed" wording (it is merged). - Module placement (confirmed): Evaluator/EvalCache are top-level domain managers (specforge/eval/), not specforge/runtime/eval/ — fix the eval-and-breadth.md outlier to match plan.md §2.3 and domain-refactor.md. - W3′ naming (confirmed): SGLangServerEngine is ONE engine with two feature transports (capture-into-FeatureStore for W3/O1.3, inline-HTTP for the light W3′) — disambiguate in §2.2, the workload table and §G2 rather than overloading one name. - O1.3 spike (reviewer's premise refuted — it is already an explicit 🔴 gate): added the valid narrow point instead — the spike scopes only the sglang_server slice of Phase B; the de-EAGLE3 extraction and domain Trainer carry no engine risk. Additional contradictions found by a completeness sweep and fixed: - StrategySpec registry: plan.md said it "stays in runtime/training unchanged" but §6 + Phase E move it — clarify the per-step strategy seam stays, the registry converges into training/strategies/. - TargetEngine source: extracted from modeling/target/*TargetModel (adapters wrap it), not "absorbs runtime/inference adapters". - Draft package: models/drafts is the target layout; note today's modeling/draft/ + real filenames. - Dependency graph: align domain-refactor (E depended on {C,D}) with README (D→E, C parallel). - Drop the up-19/up-20 branch tags that only appeared in the online doc. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist · 2026-06-30T23:26:52Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

maocheng23 · 2026-06-30T23:27:06Z

Code review

No high-confidence issues found. Checked for bugs in the DFlash adapter, offline transform/collate path, strategy registration, and online rollout-to-train wiring.

…e + online) DFlash now trains through the runtime via a StrategySpec + a DFlashAdapter, with ZERO launch.py changes (the spec seam from the previous commit carries it). - registry.py: dflash spec — offline reader (OfflineManifestReader with dflash feature_keys, no aux/target swap), per-sample transform, padding collate; online via DFlashAdapter; supports_online=True. - inference/dflash_adapter.py (new): wraps generate_dflash_data, emits {input_ids, hidden_states, loss_mask}; verify_capture self-skips the eagle3 aux/target checks (different feature names + __aux_layer_ids__=None). - tests/_fixtures.py: write_offline_files_dflash + build_dflash (tiny Qwen3 target -> DFlash draft + TargetEmbeddingsAndHead -> OnlineDFlashModel). - tests/test_dflash_launch.py + test_dflash_online_launch.py (new, GPU): offline and online dflash train end-to-end through FSDP. - tests/test_strategy_registry.py: dflash-fully-wired assertions. DFlash is online-only in production (no offline dumper exists yet — prepare_ hidden_states.py is eagle3-only), so the offline path is exercised with synthetic fixtures while online is its real workflow. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

maocheng23 · 2026-07-01T00:30:49Z

Addressed review feedback (self-review pass).

dflash_adapter.py: dropped the redundant \"__aux_layer_ids__\": None emit (the RolloutWorker reads it via feats.pop(..., None), so an absent key is identical); made the loss_mask default lazy (only allocate the all-ones mask when a task omits it, not eagerly per sample).

Deferred: extracting the length-grouped batching shared with SGLangAdapter into a base helper — it touches the pre-existing SGLangAdapter / validated eagle3 online path, so noted for a follow-up.

Validated: full tests/test_runtime = 200 OK (2 skipped, 1 xfail), zero failures, on a 2-node H200 pod. Lint clean (black 24.10.0 / isort 5.13.2 / autoflake).

maocheng23 mentioned this pull request Jun 30, 2026

docs: reconciled SpecForge architecture plan (DataFlow runtime + domain layer) #630

Merged

maocheng23 marked this pull request as ready for review June 30, 2026 23:26

maocheng23 requested a review from FrankLeeeee as a code owner June 30, 2026 23:26

maocheng23 force-pushed the dataflow-up-21-composable-launch branch from 457b20a to 8faf111 Compare July 1, 2026 00:29

maocheng23 requested review from FlamingoPg, shuaills and sleepcoo as code owners July 1, 2026 00:29

maocheng23 force-pushed the dataflow-up-22-dflash branch from 6945bfe to a6b8b7d Compare July 1, 2026 00:29

Base automatically changed from dataflow-up-21-composable-launch to dataflow-up-16-zerocopy July 2, 2026 05:41

jiapingW approved these changes Jul 2, 2026

View reviewed changes

jiapingW merged commit 14a18ba into dataflow-up-16-zerocopy Jul 2, 2026
1 check passed

jiapingW deleted the dataflow-up-22-dflash branch July 2, 2026 05:47

maocheng23 mentioned this pull request Jul 4, 2026

Merge DataFlow runtime branch into main #648

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DataFlow runtime] DFlash end-to-end on the composable launch (offline + online)#628

[DataFlow runtime] DFlash end-to-end on the composable launch (offline + online)#628
jiapingW merged 1 commit into
dataflow-up-16-zerocopyfrom
dataflow-up-22-dflash

maocheng23 commented Jun 30, 2026

Uh oh!

gemini-code-assist Bot commented Jun 30, 2026

Uh oh!

gemini-code-assist Bot commented Jun 30, 2026

Uh oh!

maocheng23 commented Jun 30, 2026

Uh oh!

maocheng23 commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

maocheng23 commented Jun 30, 2026

What

Changes

Note

Testing

Uh oh!

gemini-code-assist Bot commented Jun 30, 2026

Uh oh!

gemini-code-assist Bot commented Jun 30, 2026

Uh oh!

maocheng23 commented Jun 30, 2026

Code review

Uh oh!

maocheng23 commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants