[Full DTensor] Initial skeleton for full_dtensor mode #2049

fegin · 2025-11-17T22:19:02Z

Stack from ghstack (oldest at bottom):

This PR provides a skelet

This PR introduces an initial prototype and skeleton for fully DTensor-based training. The current codebase builds upon SimpleFSDP, but we anticipate developing our own parameterization to better serve our specific use case. There are several reasons why SimpleFSDP's parameterization is insufficient. For instance, the current parallelize_buffers() implementation in this PR will not function correctly when additional parallelization strategies are applied. Despite these limitations, this PR provides a starting point for experimenting with a full DTensor trainer.

Accuracy verification:
HSDP
SimpleFSDP v.s. FSDP2

python3 scripts/loss_compare.py . . \
--baseline-options='--activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-train-file=torchtitan.experiments.full_dtensor.train  \
--steps=10 --assert-equal --no-seed-checkpoint

[LOSS_COMPARE]
[LOSS_COMPARE] Asserting losses are equal...
[LOSS_COMPARE] Baseline log: /tmp/baseline_training.log
[LOSS_COMPARE] Test log: /tmp/test_training.log
[LOSS_COMPARE] Extracted 100 steps from baseline log
[LOSS_COMPARE] Extracted 100 steps from test log
test_losses_equal
(__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal)
... ok

----------------------------------------------------------------------
Ran 1 test in 0.000s

OK

Note that, --no-seed-checkpoint is used because when seed-checkpoint is used, we got accuracy mismatch.

[ghstack-poisoned]

This PR provides a skelet This PR introduces an initial prototype and skeleton for fully DTensor-based training. The current codebase builds upon SimpleFSDP, but we anticipate developing our own Reparameterization to better serve our specific use case. There are several reasons why SimpleFSDP's Reparameterization is insufficient. For instance, the current parallelize_buffers() implementation in this PR will not function correctly when additional parallelization strategies are applied. Despite these limitations, this PR provides a starting point for experimenting with a full DTensor trainer. Accuracy verification: HSDP SimpleFSDP v.s. FSDP2 ``` python3 scripts/loss_compare.py . . \ --baseline-options='--activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \ --test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \ --test-train-file=torchtitan.experiments.full_dtensor.train \ --steps=10 --assert-equal --no-seed-checkpoint ``` ``` [LOSS_COMPARE] [LOSS_COMPARE] Asserting losses are equal... [LOSS_COMPARE] Baseline log: /tmp/baseline_training.log [LOSS_COMPARE] Test log: /tmp/test_training.log [LOSS_COMPARE] Extracted 100 steps from baseline log [LOSS_COMPARE] Extracted 100 steps from test log test_losses_equal (__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal) ... ok ---------------------------------------------------------------------- Ran 1 test in 0.000s OK ``` Note that, `--no-seed-checkpoint` is used because when seed-checkpoint is used, we got accuracy mismatch. ghstack-source-id: 67cd703 Pull-Request: #2049

[ghstack-poisoned]

This PR provides a skelet This PR introduces an initial prototype and skeleton for fully DTensor-based training. The current codebase builds upon SimpleFSDP, but we anticipate developing our own Reparameterization to better serve our specific use case. There are several reasons why SimpleFSDP's Reparameterization is insufficient. For instance, the current parallelize_buffers() implementation in this PR will not function correctly when additional parallelization strategies are applied. Despite these limitations, this PR provides a starting point for experimenting with a full DTensor trainer. Accuracy verification: HSDP SimpleFSDP v.s. FSDP2 ``` python3 scripts/loss_compare.py . . \ --baseline-options='--activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \ --test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \ --test-train-file=torchtitan.experiments.full_dtensor.train \ --steps=10 --assert-equal --no-seed-checkpoint ``` ``` [LOSS_COMPARE] [LOSS_COMPARE] Asserting losses are equal... [LOSS_COMPARE] Baseline log: /tmp/baseline_training.log [LOSS_COMPARE] Test log: /tmp/test_training.log [LOSS_COMPARE] Extracted 100 steps from baseline log [LOSS_COMPARE] Extracted 100 steps from test log test_losses_equal (__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal) ... ok ---------------------------------------------------------------------- Ran 1 test in 0.000s OK ``` Note that, `--no-seed-checkpoint` is used because when seed-checkpoint is used, we got accuracy mismatch. ghstack-source-id: 6cf9b5e Pull-Request: #2049

[ghstack-poisoned]

This PR provides a skelet This PR introduces an initial prototype and skeleton for fully DTensor-based training. The current codebase builds upon SimpleFSDP, but we anticipate developing our own Reparameterization to better serve our specific use case. There are several reasons why SimpleFSDP's Reparameterization is insufficient. For instance, the current parallelize_buffers() implementation in this PR will not function correctly when additional parallelization strategies are applied. Despite these limitations, this PR provides a starting point for experimenting with a full DTensor trainer. Accuracy verification: HSDP SimpleFSDP v.s. FSDP2 ``` python3 scripts/loss_compare.py . . \ --baseline-options='--activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \ --test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \ --test-train-file=torchtitan.experiments.full_dtensor.train \ --steps=10 --assert-equal --no-seed-checkpoint ``` ``` [LOSS_COMPARE] [LOSS_COMPARE] Asserting losses are equal... [LOSS_COMPARE] Baseline log: /tmp/baseline_training.log [LOSS_COMPARE] Test log: /tmp/test_training.log [LOSS_COMPARE] Extracted 100 steps from baseline log [LOSS_COMPARE] Extracted 100 steps from test log test_losses_equal (__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal) ... ok ---------------------------------------------------------------------- Ran 1 test in 0.000s OK ``` Note that, `--no-seed-checkpoint` is used because when seed-checkpoint is used, we got accuracy mismatch. ghstack-source-id: 0d3e3f0 Pull-Request: #2049

[ghstack-poisoned]

This PR provides a skelet This PR introduces an initial prototype and skeleton for fully DTensor-based training. The current codebase builds upon SimpleFSDP, but we anticipate developing our own Reparameterization to better serve our specific use case. There are several reasons why SimpleFSDP's Reparameterization is insufficient. For instance, the current parallelize_buffers() implementation in this PR will not function correctly when additional parallelization strategies are applied. Despite these limitations, this PR provides a starting point for experimenting with a full DTensor trainer. Accuracy verification: HSDP SimpleFSDP v.s. FSDP2 ``` python3 scripts/loss_compare.py . . \ --baseline-options='--activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \ --test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \ --test-train-file=torchtitan.experiments.full_dtensor.train \ --steps=10 --assert-equal --no-seed-checkpoint ``` ``` [LOSS_COMPARE] [LOSS_COMPARE] Asserting losses are equal... [LOSS_COMPARE] Baseline log: /tmp/baseline_training.log [LOSS_COMPARE] Test log: /tmp/test_training.log [LOSS_COMPARE] Extracted 100 steps from baseline log [LOSS_COMPARE] Extracted 100 steps from test log test_losses_equal (__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal) ... ok ---------------------------------------------------------------------- Ran 1 test in 0.000s OK ``` Note that, `--no-seed-checkpoint` is used because when seed-checkpoint is used, we got accuracy mismatch. ghstack-source-id: c177628 Pull-Request: #2049

[ghstack-poisoned]

This PR provides a skelet This PR introduces an initial prototype and skeleton for fully DTensor-based training. The current codebase builds upon SimpleFSDP, but we anticipate developing our own Reparameterization to better serve our specific use case. There are several reasons why SimpleFSDP's Reparameterization is insufficient. For instance, the current parallelize_buffers() implementation in this PR will not function correctly when additional parallelization strategies are applied. Despite these limitations, this PR provides a starting point for experimenting with a full DTensor trainer. Accuracy verification: HSDP SimpleFSDP v.s. FSDP2 ``` python3 scripts/loss_compare.py . . \ --baseline-options='--activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \ --test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \ --test-train-file=torchtitan.experiments.full_dtensor.train \ --steps=10 --assert-equal --no-seed-checkpoint ``` ``` [LOSS_COMPARE] [LOSS_COMPARE] Asserting losses are equal... [LOSS_COMPARE] Baseline log: /tmp/baseline_training.log [LOSS_COMPARE] Test log: /tmp/test_training.log [LOSS_COMPARE] Extracted 100 steps from baseline log [LOSS_COMPARE] Extracted 100 steps from test log test_losses_equal (__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal) ... ok ---------------------------------------------------------------------- Ran 1 test in 0.000s OK ``` Note that, `--no-seed-checkpoint` is used because when seed-checkpoint is used, we got accuracy mismatch. ghstack-source-id: 955a260 Pull-Request: #2049

[ghstack-poisoned]

This PR provides a skelet This PR introduces an initial prototype and skeleton for fully DTensor-based training. The current codebase builds upon SimpleFSDP, but we anticipate developing our own Reparameterization to better serve our specific use case. There are several reasons why SimpleFSDP's Reparameterization is insufficient. For instance, the current parallelize_buffers() implementation in this PR will not function correctly when additional parallelization strategies are applied. Despite these limitations, this PR provides a starting point for experimenting with a full DTensor trainer. Accuracy verification: HSDP SimpleFSDP v.s. FSDP2 ``` python3 scripts/loss_compare.py . . \ --baseline-options='--activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \ --test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \ --test-train-file=torchtitan.experiments.full_dtensor.train \ --steps=10 --assert-equal --no-seed-checkpoint ``` ``` [LOSS_COMPARE] [LOSS_COMPARE] Asserting losses are equal... [LOSS_COMPARE] Baseline log: /tmp/baseline_training.log [LOSS_COMPARE] Test log: /tmp/test_training.log [LOSS_COMPARE] Extracted 100 steps from baseline log [LOSS_COMPARE] Extracted 100 steps from test log test_losses_equal (__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal) ... ok ---------------------------------------------------------------------- Ran 1 test in 0.000s OK ``` Note that, `--no-seed-checkpoint` is used because when seed-checkpoint is used, we got accuracy mismatch. ghstack-source-id: 92d8e21 Pull-Request: #2049

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2049 * __->__ #2029 ## Summary This PR adds `scripts/loss_compare.py` for comparing training losses between different git commits and/or training configurations. ## Key Features - Commit Comparison: Compare losses between two different git commits with deterministic training - Configuration Comparison: Compare different training configurations on the same commit - Reproducibility: Automatically enables deterministic mode and seed checkpointing for reproducible comparisons - Real-time Output: Streams training output to both console and log files during execution - Statistical Analysis: Generates step-by-step loss comparisons and summary statistics - CI Testing: Includes --assert-equal flag for automated testing to verify identical losses ## Usage Examples #### Compare two commits ``` python3 ./scripts/loss_compare.py main my_branch ``` #### Compare two commits with custom configuration ``` python3 ./scripts/loss_compare.py main my_branch \ --baseline-config="./custom.toml" --baseline-options="--parallelism.tensor_parallel_degree=2" \ ``` #### Compare different parallelization strategies on same commit ``` python3 ./scripts/loss_compare.py . . \ --baseline-config="./llama3_8b.toml" --baseline-options="--parallelism.tensor_parallel_degree=2" \ --test-options="--parallelism.tensor_parallel_degree=1" \ ``` #### Assert equality for CI testing ``` python3 ./scripts/loss_compare.py main my_branch --assert-equal ``` ## Real Use Cases Compare full dtensor simple fsdp with fsdp2: ``` python3 scripts/loss_compare.py . . \ --baseline-options='--activation_checkpoint.mode="none"' \ --test-train-file='torchtitan.experiments.full_dtensor.train' \ --test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none"' \ --assert-equal --no-seed-checkpoint [LOSS_COMPARE] [LOSS_COMPARE] Asserting losses are equal... [LOSS_COMPARE] Baseline log: /tmp/baseline_training.log [LOSS_COMPARE] Test log: /tmp/test_training.log [LOSS_COMPARE] Extracted 100 steps from baseline log [LOSS_COMPARE] Extracted 100 steps from test log test_losses_equal (__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal) ... ok ```

Update

d5e8a97

[ghstack-poisoned]

fegin requested review from tianyu-l, wconstab and wwwjn as code owners November 17, 2025 22:19

fegin mentioned this pull request Nov 17, 2025

Add a loss comparison script #2029

Merged

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 17, 2025

fegin marked this pull request as draft November 17, 2025 22:25

Update

155d733

[ghstack-poisoned]

Update

84a4c65

[ghstack-poisoned]

Update

ac48c4f

[ghstack-poisoned]

Update

0aad378

[ghstack-poisoned]

Update

d3e001e

[ghstack-poisoned]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Full DTensor] Initial skeleton for full_dtensor mode #2049

[Full DTensor] Initial skeleton for full_dtensor mode #2049

Uh oh!

fegin commented Nov 17, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Full DTensor] Initial skeleton for full_dtensor mode #2049

Are you sure you want to change the base?

[Full DTensor] Initial skeleton for full_dtensor mode #2049

Uh oh!

Conversation

fegin commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fegin commented Nov 17, 2025 •

edited

Loading