Skip to content

Conversation

@billishyahao
Copy link

This patch is to solve the crash after enabling logging optimizer states into tensorboard by setting flag --log-optimizer-states-to-tensorboard.

[rank3]: Traceback (most recent call last):                                                                                                                             
[rank3]:   File "/yahao/Megatron-DeepSpeed/pretrain_gpt.py", line 356, in <module>                                                                                      
[rank3]:     pretrain(train_valid_test_datasets_provider,                                                                                                               
[rank3]:   File "/yahao/Megatron-DeepSpeed/megatron/training.py", line 222, in pretrain                                                                                 
[rank3]:     iteration = train(forward_step_func,                                                                                                                       
[rank3]:   File "/yahao/Megatron-DeepSpeed/megatron/training.py", line 1264, in train                                                                                   
[rank3]:     report_memory_flag = training_log(loss_dict, total_loss_dict,                                                                                              
[rank3]:   File "/yahao/Megatron-DeepSpeed/megatron/training.py", line 999, in training_log                                                                             
[rank3]:     opt_stats[0] += (torch.norm(optimizer.state[param]['exp_avg_sq']).item())**2                                                                               
[rank3]: AttributeError: 'BF16_Optimizer' object has no attribute 'state'

@loadams loadams merged commit 1280f59 into deepspeedai:main Aug 27, 2024
loadams pushed a commit that referenced this pull request Feb 7, 2025
Signed-off-by: Logan Adams <loadams@microsoft.com>
YJHMITWEB pushed a commit to YJHMITWEB/Megatron-DeepSpeed that referenced this pull request Aug 9, 2025
Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>
YJHMITWEB pushed a commit to YJHMITWEB/Megatron-DeepSpeed that referenced this pull request Aug 9, 2025
Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>
tjruwase pushed a commit that referenced this pull request Aug 14, 2025
…nabled (#479)

* pass batch_dim_idx to deepspeed sequence parallel distributed attention for supporting batch size larger than 1

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* add fused_rms_norm support on XPU device (#431)

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* [LLaMa] Adding support converting checkpoint from mds to hf (#432)

* add support converting checkpoint from hf to mds

* Fix PP issue

* update

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* add device check when import ipex (#436)

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* fix TFLOPs calculation (#371)

* fix TFLOPs calculation

when GQA used, we observe right TFLOPs after this fix.
when GQA is not used, huge difference in TFLOPs is solved with
selective recompute .
some other minor difference will also be observed as logits macs also added.

* add copyrights

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* fix nan issue when running megatron-deepspeed (#434)

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* enable empty cache on XPU device (#438)

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* [wandb] disable wandb more gracefully (#422)

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* [Bug] Fix crash when logging optimizer state to tb (#417)

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* add FPDT support; add Ulysses rotary position embedding support

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* add FPDT support; add Ulysses rotary position embedding support

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* add FPDT support; add Ulysses rotary position embedding support

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* add FPDT support; add Ulysses rotary position embedding support

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* remove unnecessary files

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* set the warmup length to be FPDT chunk size if enabled

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* Enable Sequence Parallelism (#429)

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* grad_wei can't be NoneType when running with DeepSpeed, for zero3 will divided the gradient (#428)

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* fix init issue for rms_norm in squence_parallel (#448)

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* enable profiler for specific ranks (#451)

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* fix init issue for silently ignoring the deepspeed config (#452)

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* fix moe tflops (#445)

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* [tool]GQA convert support (#454)

* [tools]GQA convert support

* fix readme

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* Fix import error in `deepspeed_to_megatron.py` (#455)

Previously, `deepspeed_to_megatron.py` would raise an import error
due to the relative import.

This commit fixes this issue by changing from the relative import
to the absolute import like in `deepspeed_to_transformers.py`.

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* Update references to new GitHub org (deepspeedai) (#462)

Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* add sequence_parallel in layernorm init to enable 3D parallelism can run successfully with DeepSpeed (#468)

Signed-off-by: yisheng <yi.sheng@intel.com>
Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

* fix bug when FPDT is disabled but with original Ulysses

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>
Signed-off-by: jinghan yao yjhmitweb@gmail.com
Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>

---------

Signed-off-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: yisheng <yi.sheng@intel.com>
Signed-off-by: jinghan yao yjhmitweb@gmail.com
Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw02.ten.osc.edu>
Co-authored-by: YiSheng5 <syhm@mail.ustc.edu.cn>
Co-authored-by: billishyahao <yahao.he@gmail.com>
Co-authored-by: Polisetty V R K Jyothendra Varma <jvarma@habana.ai>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw01.ten.osc.edu>
Co-authored-by: ranzhejiang <zhejiang.ran@intel.com>
Co-authored-by: Xinyu Lian <lian7@illinois.edu>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: hotsuyuki <hotsuyuki.kawanishi@gmail.com>
Co-authored-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants