Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] zero bubble #546

Draft
wants to merge 7 commits into
base: gh/H-Huang/13/base
Choose a base branch
from
Draft

[WIP] zero bubble #546

wants to merge 7 commits into from

Conversation

H-Huang
Copy link
Member

@H-Huang H-Huang commented Aug 20, 2024

Stack from ghstack (oldest at bottom):

To run zb test:
python test_runner.py ./test_out --test pp_zb

internal mast run:
torchx run mast.py:train --additional_folders /home/howardhuang/local/torchtitan --twtask_bootstrap_script run_torchtitan.sh --h "grandteton" --nodes 8 train_configs/debug_model_3d_mast.toml

[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Aug 20, 2024
ghstack-source-id: fd042c2482ffeac2a9b9bd53e29e803858875cca
Pull Request resolved: #546
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 20, 2024
@H-Huang H-Huang marked this pull request as draft August 20, 2024 19:38
To run zb test: 
`python test_runner.py ./test_out --test pp_zb`

TODO:
- zero bubble when AC is turned off is failing when using multiple hosts:
```
File "/packages/torchtitan_additional_packages/torchtitan/torchtitan/parallelisms/pipelining/stage.py", line 668, in backward_weight_one_chunk
      dweights = self.dw_runner.pop(bwd_chunk_id)(
    File "/packages/torchtitan_additional_packages/torchtitan/torchtitan/parallelisms/pipelining/_backward.py", line 251, in stage_backward_weight
      dweight = all_dweights[grad_acc]
  KeyError: <AccumulateGrad object at 0x7ff490125b10>
```
  


[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Aug 28, 2024
ghstack-source-id: c6b2f05d19e207232306320decf506316c066347
Pull Request resolved: #546
To run zb test: 
`python test_runner.py ./test_out --test pp_zb`

TODO:
- zero bubble when AC is turned off is failing when using multiple hosts:
```
File "/packages/torchtitan_additional_packages/torchtitan/torchtitan/parallelisms/pipelining/stage.py", line 668, in backward_weight_one_chunk
      dweights = self.dw_runner.pop(bwd_chunk_id)(
    File "/packages/torchtitan_additional_packages/torchtitan/torchtitan/parallelisms/pipelining/_backward.py", line 251, in stage_backward_weight
      dweight = all_dweights[grad_acc]
  KeyError: <AccumulateGrad object at 0x7ff490125b10>
```
  


[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Sep 5, 2024
ghstack-source-id: bdcb4b3eba1b3ab35bedb745a1f334101c259ee7
Pull Request resolved: #546
To run zb test: 
`python test_runner.py ./test_out --test pp_zb`

TODO:
- zero bubble when AC is turned off is failing when using multiple hosts:
```
File "/packages/torchtitan_additional_packages/torchtitan/torchtitan/parallelisms/pipelining/stage.py", line 668, in backward_weight_one_chunk
      dweights = self.dw_runner.pop(bwd_chunk_id)(
    File "/packages/torchtitan_additional_packages/torchtitan/torchtitan/parallelisms/pipelining/_backward.py", line 251, in stage_backward_weight
      dweight = all_dweights[grad_acc]
  KeyError: <AccumulateGrad object at 0x7ff490125b10>
```
  


[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Sep 5, 2024
ghstack-source-id: d813ea1be1375567b1eda4f418bfae9d6fbdd84c
Pull Request resolved: #546
To run zb test: 
`python test_runner.py ./test_out --test pp_zb`

TODO:
- zero bubble when AC is turned off is failing when using multiple hosts:
```
File "/packages/torchtitan_additional_packages/torchtitan/torchtitan/parallelisms/pipelining/stage.py", line 668, in backward_weight_one_chunk
      dweights = self.dw_runner.pop(bwd_chunk_id)(
    File "/packages/torchtitan_additional_packages/torchtitan/torchtitan/parallelisms/pipelining/_backward.py", line 251, in stage_backward_weight
      dweight = all_dweights[grad_acc]
  KeyError: <AccumulateGrad object at 0x7ff490125b10>
```
  


[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Sep 26, 2024
ghstack-source-id: 954580ed2484b02b2d0a0e205de7239b2cb8e3df
Pull Request resolved: #546
To run zb test: 
`python test_runner.py ./test_out --test pp_zb`

internal mast run:
`
torchx run mast.py:train --additional_folders /home/howardhuang/local/torchtitan --twtask_bootstrap_script run_torchtitan.sh --h "grandteton" --nodes 8 train_configs/debug_model_3d_mast.toml
`
  


[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Sep 26, 2024
ghstack-source-id: ec662ad4cdf9b4e12d9b334e09d2658926b25ace
Pull Request resolved: #546
To run zb test: 
`python test_runner.py ./test_out --test pp_zb`

internal mast run:
`
torchx run mast.py:train --additional_folders /home/howardhuang/local/torchtitan --twtask_bootstrap_script run_torchtitan.sh --h "grandteton" --nodes 8 train_configs/debug_model_3d_mast.toml
`
  


[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Sep 30, 2024
ghstack-source-id: 9c557a4a76a019f3727985e777a1deb66c3ca941
Pull Request resolved: #546
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants