-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Issues: microsoft/DeepSpeed
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
[BUG] FAILED: multi_tensor_adam.cuda.o with
bug
Something isn't working
training
#6912
opened Dec 24, 2024 by
XueruiSu
[BUG]Convergence Issue: Training BERT for Embedding with Zero2 and 3 as compared to Torchrun
bug
Something isn't working
training
#6911
opened Dec 24, 2024 by
dawnik17
[BUG] triton kernel, loss 0, grar-norm nan
bug
Something isn't working
training
#6902
opened Dec 22, 2024 by
mdy666
DeepSpeed with ZeRO3 strategy cannot build 'fused_adam'
bug
Something isn't working
training
#6892
opened Dec 18, 2024 by
LeonardoZini
How do I know if stage-3 is a success by using deepspeed?
training
#6877
opened Dec 16, 2024 by
hwhyyds
[BUG] Cannot use --hostfile to start multi-node training in Docker.
bug
Something isn't working
training
#6875
opened Dec 16, 2024 by
Ind1x1
[BUG] Invalidate trace cache @ step 10: expected module 11, but got module 19
bug
Something isn't working
training
#6870
opened Dec 14, 2024 by
yafuly
[BUG] Mismatch of model parameters when using Sequence Parallel
bug
Something isn't working
training
#6868
opened Dec 13, 2024 by
chetwin-character
[BUG]When fine-tuning an LLM, the following error occurs after training for some time: self.optimizer.param_groups[param_group_id]['params'] = [] IndexError: list index out of range
bug
Something isn't working
training
#6857
opened Dec 12, 2024 by
tdtgi
[BUG] DeepSpeed accuracy issue for torch.compile if activation checkpoint function not compiler disabled
bug
Something isn't working
training
#6811
opened Dec 1, 2024 by
NirSonnenschein
[BUG] Getting "SymIntArrayRef expected to contain only concrete integers" error when > 1 GPU
bug
Something isn't working
training
#6806
opened Nov 28, 2024 by
rileyhun
[BUG] [Fix-Suggested] ZeRO Stage 3 Overwrites Module ID Attribute Causing Incorrect Expert Placement on GPUs
bug
Something isn't working
training
#6772
opened Nov 20, 2024 by
traincheck-team
[BUG] clip_grad_norm for zero_optimization mode is not working
bug
Something isn't working
training
#6767
opened Nov 20, 2024 by
chengmengli06
[BUG]NCCL operation timeout when training with deepspeed_zero3_offload or deepspeed_zero3 on RTX4090
bug
Something isn't working
training
#6756
opened Nov 18, 2024 by
MLS2021
[BUG] [ROCm] Fine-tuning DeepSeek-Coder-V2-Lite-Instruct with 8 MI300X GPUs results in c10::DistBackendError
bug
Something isn't working
rocm
AMD/ROCm/HIP issues
training
#6725
opened Nov 8, 2024 by
nikhil-tensorwave
[BUG] Zero3 for torch.compile with compiled_autograd when running LayerNorm
bug
Something isn't working
training
#6719
opened Nov 6, 2024 by
yitingw1
[BUG] DeepSpeed accuracy issue for torch.compile if activation checkpoint function not compiler disabled
bug
Something isn't working
training
#6718
opened Nov 6, 2024 by
jerrychenhf
[BUG]Issue with Zero Optimization for Llama-2-7b Fine-Tuning on Intel GPUs
bug
Something isn't working
training
#6713
opened Nov 5, 2024 by
molang66
[BUG] Universal Checkpoint Conversion: Resumed Training Behaves as If Model Initialized from Scratch
bug
Something isn't working
training
#6691
opened Oct 30, 2024 by
purefall
[BUG] ZeRO++ sharding small parameter raise IndexError
bug
Something isn't working
training
#6659
opened Oct 23, 2024 by
wuxibin89
[BUG] Training batch size is not consistent with train_batch_size
bug
Something isn't working
training
#6657
opened Oct 23, 2024 by
tnnandi
[BUG] RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
bug
Something isn't working
training
#6643
opened Oct 20, 2024 by
RickoNoNo3
[BUG] MOE: Loading experts parameters error when using expert parallel.
bug
Something isn't working
training
#6589
opened Sep 29, 2024 by
kakaxi-liu
Previous Next
ProTip!
Type g p on any issue or pull request to go back to the pull request listing page.