[Bug] Fix crash when logging optimizer state to tensorboard #417

billishyahao · 2024-07-11T07:05:34Z

This patch is to solve the crash after enabling logging optimizer states into tensorboard by setting flag --log-optimizer-states-to-tensorboard.

[rank3]: Traceback (most recent call last):                                                                                                                             
[rank3]:   File "/yahao/Megatron-DeepSpeed/pretrain_gpt.py", line 356, in <module>                                                                                      
[rank3]:     pretrain(train_valid_test_datasets_provider,                                                                                                               
[rank3]:   File "/yahao/Megatron-DeepSpeed/megatron/training.py", line 222, in pretrain                                                                                 
[rank3]:     iteration = train(forward_step_func,                                                                                                                       
[rank3]:   File "/yahao/Megatron-DeepSpeed/megatron/training.py", line 1264, in train                                                                                   
[rank3]:     report_memory_flag = training_log(loss_dict, total_loss_dict,                                                                                              
[rank3]:   File "/yahao/Megatron-DeepSpeed/megatron/training.py", line 999, in training_log                                                                             
[rank3]:     opt_stats[0] += (torch.norm(optimizer.state[param]['exp_avg_sq']).item())**2                                                                               
[rank3]: AttributeError: 'BF16_Optimizer' object has no attribute 'state'

Signed-off-by: Logan Adams <loadams@microsoft.com>

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

…nabled (#479) * pass batch_dim_idx to deepspeed sequence parallel distributed attention for supporting batch size larger than 1 Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add fused_rms_norm support on XPU device (#431) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * [LLaMa] Adding support converting checkpoint from mds to hf (#432) * add support converting checkpoint from hf to mds * Fix PP issue * update Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add device check when import ipex (#436) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix TFLOPs calculation (#371) * fix TFLOPs calculation when GQA used, we observe right TFLOPs after this fix. when GQA is not used, huge difference in TFLOPs is solved with selective recompute . some other minor difference will also be observed as logits macs also added. * add copyrights Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix nan issue when running megatron-deepspeed (#434) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * enable empty cache on XPU device (#438) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * [wandb] disable wandb more gracefully (#422) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * [Bug] Fix crash when logging optimizer state to tb (#417) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add FPDT support; add Ulysses rotary position embedding support Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * remove unnecessary files Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * set the warmup length to be FPDT chunk size if enabled Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * Enable Sequence Parallelism (#429) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * grad_wei can't be NoneType when running with DeepSpeed, for zero3 will divided the gradient (#428) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix init issue for rms_norm in squence_parallel (#448) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * enable profiler for specific ranks (#451) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix init issue for silently ignoring the deepspeed config (#452) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix moe tflops (#445) Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * [tool]GQA convert support (#454) * [tools]GQA convert support * fix readme Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * Fix import error in `deepspeed_to_megatron.py` (#455) Previously, `deepspeed_to_megatron.py` would raise an import error due to the relative import. This commit fixes this issue by changing from the relative import to the absolute import like in `deepspeed_to_transformers.py`. Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * Update references to new GitHub org (deepspeedai) (#462) Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * add sequence_parallel in layernorm init to enable 3D parallelism can run successfully with DeepSpeed (#468) Signed-off-by: yisheng <yi.sheng@intel.com> Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> * fix bug when FPDT is disabled but with original Ulysses Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> Signed-off-by: jinghan yao yjhmitweb@gmail.com Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> --------- Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: yisheng <yi.sheng@intel.com> Signed-off-by: jinghan yao yjhmitweb@gmail.com Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw02.ten.osc.edu> Co-authored-by: YiSheng5 <syhm@mail.ustc.edu.cn> Co-authored-by: billishyahao <yahao.he@gmail.com> Co-authored-by: Polisetty V R K Jyothendra Varma <jvarma@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw01.ten.osc.edu> Co-authored-by: ranzhejiang <zhejiang.ran@intel.com> Co-authored-by: Xinyu Lian <lian7@illinois.edu> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: hotsuyuki <hotsuyuki.kawanishi@gmail.com> Co-authored-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

[Bug] Fix crash when logging optimizer state to tb

00caa0e

billishyahao requested review from GuanhuaWang, arashb, awan-10, duli2012, eltonzheng, tjruwase and xiaoxiawu-microsoft as code owners July 11, 2024 07:05

loadams approved these changes Aug 27, 2024

View reviewed changes

loadams merged commit 1280f59 into deepspeedai:main Aug 27, 2024

delock mentioned this pull request Sep 20, 2024

[TRACKER] Customer support related PR tracker for Intel devices #446

Open

12 tasks

loadams pushed a commit that referenced this pull request Feb 7, 2025

[Bug] Fix crash when logging optimizer state to tb (#417)

3e3ac63

Signed-off-by: Logan Adams <loadams@microsoft.com>

YJHMITWEB pushed a commit to YJHMITWEB/Megatron-DeepSpeed that referenced this pull request Aug 9, 2025

[Bug] Fix crash when logging optimizer state to tb (deepspeedai#417)

79ebb9f

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

YJHMITWEB pushed a commit to YJHMITWEB/Megatron-DeepSpeed that referenced this pull request Aug 9, 2025

[Bug] Fix crash when logging optimizer state to tb (deepspeedai#417)

83861ea

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Fix crash when logging optimizer state to tensorboard #417

[Bug] Fix crash when logging optimizer state to tensorboard #417

Uh oh!

billishyahao commented Jul 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Bug] Fix crash when logging optimizer state to tensorboard #417

[Bug] Fix crash when logging optimizer state to tensorboard #417

Uh oh!

Conversation

billishyahao commented Jul 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants