Skip to content

Conversation

@fegin
Copy link
Contributor

@fegin fegin commented Nov 17, 2025

Stack from ghstack (oldest at bottom):

This PR provides a skelet

This PR introduces an initial prototype and skeleton for fully DTensor-based training. The current codebase builds upon SimpleFSDP, but we anticipate developing our own parameterization to better serve our specific use case. There are several reasons why SimpleFSDP's parameterization is insufficient. For instance, the current parallelize_buffers() implementation in this PR will not function correctly when additional parallelization strategies are applied. Despite these limitations, this PR provides a starting point for experimenting with a full DTensor trainer.

Accuracy verification:
HSDP
SimpleFSDP v.s. FSDP2

python3 scripts/loss_compare.py . . \
--baseline-options='--activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-train-file=torchtitan.experiments.full_dtensor.train  \
--steps=10 --assert-equal --no-seed-checkpoint
[LOSS_COMPARE]
[LOSS_COMPARE] Asserting losses are equal...
[LOSS_COMPARE] Baseline log: /tmp/baseline_training.log
[LOSS_COMPARE] Test log: /tmp/test_training.log
[LOSS_COMPARE] Extracted 100 steps from baseline log
[LOSS_COMPARE] Extracted 100 steps from test log
test_losses_equal
(__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal)
... ok

----------------------------------------------------------------------
Ran 1 test in 0.000s

OK

Note that, --no-seed-checkpoint is used because when seed-checkpoint is used, we got accuracy mismatch.

[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 17, 2025
This PR provides a skelet

This PR introduces an initial prototype and skeleton for fully DTensor-based training. The current codebase builds upon SimpleFSDP, but we anticipate developing our own Reparameterization to better serve our specific use case. There are several reasons why SimpleFSDP's Reparameterization is insufficient. For instance, the current parallelize_buffers() implementation in this PR will not function correctly when additional parallelization strategies are applied.  Despite these limitations, this PR provides a starting point for experimenting with a full DTensor trainer.

Accuracy verification:
HSDP
SimpleFSDP v.s. FSDP2

```
python3 scripts/loss_compare.py . . \
--baseline-options='--activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-train-file=torchtitan.experiments.full_dtensor.train  \
--steps=10 --assert-equal --no-seed-checkpoint
```

```
[LOSS_COMPARE]
[LOSS_COMPARE] Asserting losses are equal...
[LOSS_COMPARE] Baseline log: /tmp/baseline_training.log
[LOSS_COMPARE] Test log: /tmp/test_training.log
[LOSS_COMPARE] Extracted 100 steps from baseline log
[LOSS_COMPARE] Extracted 100 steps from test log
test_losses_equal
(__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal)
... ok

----------------------------------------------------------------------
Ran 1 test in 0.000s

OK
```

Note that, `--no-seed-checkpoint` is used because when seed-checkpoint is used, we got accuracy mismatch.


ghstack-source-id: 67cd703
Pull-Request: #2049
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 17, 2025
@fegin fegin marked this pull request as draft November 17, 2025 22:25
[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 17, 2025
This PR provides a skelet

This PR introduces an initial prototype and skeleton for fully DTensor-based training. The current codebase builds upon SimpleFSDP, but we anticipate developing our own Reparameterization to better serve our specific use case. There are several reasons why SimpleFSDP's Reparameterization is insufficient. For instance, the current parallelize_buffers() implementation in this PR will not function correctly when additional parallelization strategies are applied.  Despite these limitations, this PR provides a starting point for experimenting with a full DTensor trainer.

Accuracy verification:
HSDP
SimpleFSDP v.s. FSDP2

```
python3 scripts/loss_compare.py . . \
--baseline-options='--activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-train-file=torchtitan.experiments.full_dtensor.train  \
--steps=10 --assert-equal --no-seed-checkpoint
```

```
[LOSS_COMPARE]
[LOSS_COMPARE] Asserting losses are equal...
[LOSS_COMPARE] Baseline log: /tmp/baseline_training.log
[LOSS_COMPARE] Test log: /tmp/test_training.log
[LOSS_COMPARE] Extracted 100 steps from baseline log
[LOSS_COMPARE] Extracted 100 steps from test log
test_losses_equal
(__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal)
... ok

----------------------------------------------------------------------
Ran 1 test in 0.000s

OK
```

Note that, `--no-seed-checkpoint` is used because when seed-checkpoint is used, we got accuracy mismatch.

ghstack-source-id: 6cf9b5e
Pull-Request: #2049
[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 18, 2025
This PR provides a skelet

This PR introduces an initial prototype and skeleton for fully DTensor-based training. The current codebase builds upon SimpleFSDP, but we anticipate developing our own Reparameterization to better serve our specific use case. There are several reasons why SimpleFSDP's Reparameterization is insufficient. For instance, the current parallelize_buffers() implementation in this PR will not function correctly when additional parallelization strategies are applied.  Despite these limitations, this PR provides a starting point for experimenting with a full DTensor trainer.

Accuracy verification:
HSDP
SimpleFSDP v.s. FSDP2

```
python3 scripts/loss_compare.py . . \
--baseline-options='--activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-train-file=torchtitan.experiments.full_dtensor.train  \
--steps=10 --assert-equal --no-seed-checkpoint
```

```
[LOSS_COMPARE]
[LOSS_COMPARE] Asserting losses are equal...
[LOSS_COMPARE] Baseline log: /tmp/baseline_training.log
[LOSS_COMPARE] Test log: /tmp/test_training.log
[LOSS_COMPARE] Extracted 100 steps from baseline log
[LOSS_COMPARE] Extracted 100 steps from test log
test_losses_equal
(__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal)
... ok

----------------------------------------------------------------------
Ran 1 test in 0.000s

OK
```

Note that, `--no-seed-checkpoint` is used because when seed-checkpoint is used, we got accuracy mismatch.

ghstack-source-id: 0d3e3f0
Pull-Request: #2049
[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 19, 2025
This PR provides a skelet

This PR introduces an initial prototype and skeleton for fully DTensor-based training. The current codebase builds upon SimpleFSDP, but we anticipate developing our own Reparameterization to better serve our specific use case. There are several reasons why SimpleFSDP's Reparameterization is insufficient. For instance, the current parallelize_buffers() implementation in this PR will not function correctly when additional parallelization strategies are applied.  Despite these limitations, this PR provides a starting point for experimenting with a full DTensor trainer.

Accuracy verification:
HSDP
SimpleFSDP v.s. FSDP2

```
python3 scripts/loss_compare.py . . \
--baseline-options='--activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-train-file=torchtitan.experiments.full_dtensor.train  \
--steps=10 --assert-equal --no-seed-checkpoint
```

```
[LOSS_COMPARE]
[LOSS_COMPARE] Asserting losses are equal...
[LOSS_COMPARE] Baseline log: /tmp/baseline_training.log
[LOSS_COMPARE] Test log: /tmp/test_training.log
[LOSS_COMPARE] Extracted 100 steps from baseline log
[LOSS_COMPARE] Extracted 100 steps from test log
test_losses_equal
(__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal)
... ok

----------------------------------------------------------------------
Ran 1 test in 0.000s

OK
```

Note that, `--no-seed-checkpoint` is used because when seed-checkpoint is used, we got accuracy mismatch.

ghstack-source-id: c177628
Pull-Request: #2049
[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 19, 2025
This PR provides a skelet

This PR introduces an initial prototype and skeleton for fully DTensor-based training. The current codebase builds upon SimpleFSDP, but we anticipate developing our own Reparameterization to better serve our specific use case. There are several reasons why SimpleFSDP's Reparameterization is insufficient. For instance, the current parallelize_buffers() implementation in this PR will not function correctly when additional parallelization strategies are applied.  Despite these limitations, this PR provides a starting point for experimenting with a full DTensor trainer.

Accuracy verification:
HSDP
SimpleFSDP v.s. FSDP2

```
python3 scripts/loss_compare.py . . \
--baseline-options='--activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-train-file=torchtitan.experiments.full_dtensor.train  \
--steps=10 --assert-equal --no-seed-checkpoint
```

```
[LOSS_COMPARE]
[LOSS_COMPARE] Asserting losses are equal...
[LOSS_COMPARE] Baseline log: /tmp/baseline_training.log
[LOSS_COMPARE] Test log: /tmp/test_training.log
[LOSS_COMPARE] Extracted 100 steps from baseline log
[LOSS_COMPARE] Extracted 100 steps from test log
test_losses_equal
(__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal)
... ok

----------------------------------------------------------------------
Ran 1 test in 0.000s

OK
```

Note that, `--no-seed-checkpoint` is used because when seed-checkpoint is used, we got accuracy mismatch.

ghstack-source-id: 955a260
Pull-Request: #2049
[ghstack-poisoned]
fegin added a commit that referenced this pull request Nov 19, 2025
This PR provides a skelet

This PR introduces an initial prototype and skeleton for fully DTensor-based training. The current codebase builds upon SimpleFSDP, but we anticipate developing our own Reparameterization to better serve our specific use case. There are several reasons why SimpleFSDP's Reparameterization is insufficient. For instance, the current parallelize_buffers() implementation in this PR will not function correctly when additional parallelization strategies are applied.  Despite these limitations, this PR provides a starting point for experimenting with a full DTensor trainer.

Accuracy verification:
HSDP
SimpleFSDP v.s. FSDP2

```
python3 scripts/loss_compare.py . . \
--baseline-options='--activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none" --parallelism.data_parallel_replicate_degree=2' \
--test-train-file=torchtitan.experiments.full_dtensor.train  \
--steps=10 --assert-equal --no-seed-checkpoint
```

```
[LOSS_COMPARE]
[LOSS_COMPARE] Asserting losses are equal...
[LOSS_COMPARE] Baseline log: /tmp/baseline_training.log
[LOSS_COMPARE] Test log: /tmp/test_training.log
[LOSS_COMPARE] Extracted 100 steps from baseline log
[LOSS_COMPARE] Extracted 100 steps from test log
test_losses_equal
(__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal)
... ok

----------------------------------------------------------------------
Ran 1 test in 0.000s

OK
```

Note that, `--no-seed-checkpoint` is used because when seed-checkpoint is used, we got accuracy mismatch.

ghstack-source-id: 92d8e21
Pull-Request: #2049
fegin added a commit that referenced this pull request Nov 19, 2025
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2049
* __->__ #2029


## Summary
This PR adds `scripts/loss_compare.py` for comparing training losses
between different git commits and/or training configurations.

## Key Features

- Commit Comparison: Compare losses between two different git commits
with deterministic training
- Configuration Comparison: Compare different training configurations on
the same commit
- Reproducibility: Automatically enables deterministic mode and seed
checkpointing for reproducible
  comparisons
- Real-time Output: Streams training output to both console and log
files during execution
- Statistical Analysis: Generates step-by-step loss comparisons and
summary statistics
- CI Testing: Includes --assert-equal flag for automated testing to
verify identical losses

## Usage Examples

#### Compare two commits
```
python3 ./scripts/loss_compare.py main my_branch
```
#### Compare two commits with custom configuration 
```
python3 ./scripts/loss_compare.py main my_branch \
--baseline-config="./custom.toml" 
--baseline-options="--parallelism.tensor_parallel_degree=2"  \
```

#### Compare different parallelization strategies on same commit
```
python3 ./scripts/loss_compare.py . . \
--baseline-config="./llama3_8b.toml" 
--baseline-options="--parallelism.tensor_parallel_degree=2" \
--test-options="--parallelism.tensor_parallel_degree=1" \
```

#### Assert equality for CI testing
```
python3 ./scripts/loss_compare.py main my_branch --assert-equal
```


## Real Use Cases
Compare full dtensor simple fsdp with fsdp2:
```
python3 scripts/loss_compare.py . . \
--baseline-options='--activation_checkpoint.mode="none"'  \
--test-train-file='torchtitan.experiments.full_dtensor.train' \ 
--test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none"'  \
 --assert-equal --no-seed-checkpoint


[LOSS_COMPARE]
[LOSS_COMPARE] Asserting losses are equal...
[LOSS_COMPARE] Baseline log: /tmp/baseline_training.log
[LOSS_COMPARE] Test log: /tmp/test_training.log
[LOSS_COMPARE] Extracted 100 steps from baseline log
[LOSS_COMPARE] Extracted 100 steps from test log
test_losses_equal (__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal) ... ok
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants