-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Insights: deepspeedai/DeepSpeed
Overview
Could not load contribution data
Please try again later
13 Pull requests merged by 8 people
-
fix(inference): Add missing dtype attribute to ParameterBase setter
#7378 merged
Jun 23, 2025 -
Add support for ws=1 scenario
#7379 merged
Jun 23, 2025 -
Fix dtype mismatch in
TestParamPartitioningSkipInit
#7377 merged
Jun 23, 2025 -
fix wandb.log() call by removing
sync
kwarg#7383 merged
Jun 23, 2025 -
Fix release of IPG buffer
#7376 merged
Jun 22, 2025 -
Update latest news with DeepNVMe
#7375 merged
Jun 20, 2025 -
Relax tolerances for FP8 unit test only for ROCm + FP16
#7373 merged
Jun 20, 2025 -
Flops profiler support for F.interpolate
#7353 merged
Jun 20, 2025 -
add Arctic Long Sequence Training paper reference
#7372 merged
Jun 20, 2025 -
Enable torch.autocast with ZeRO
#6993 merged
Jun 19, 2025 -
sequence parallel default dtype
#7364 merged
Jun 19, 2025 -
Fix(scheduler): WarmupLR inherits optimizer lr when not specified
#7360 merged
Jun 19, 2025 -
Restore real inputs for recompilation
#7356 merged
Jun 19, 2025
1 Pull request opened by 1 person
-
fix #7188
#7371 opened
Jun 19, 2025
9 Issues closed by 5 people
-
[REQUEST] Support for XLA/TPU
#6901 closed
Jun 24, 2025 -
[BUG] AttributeError: 'UnembedParameter' object has no attribute 'dtype'
#7260 closed
Jun 23, 2025 -
[BUG] WandbMonitor log() invocation broken with wandb 0.20.0
#7381 closed
Jun 23, 2025 -
[BUG]Training
#7319 closed
Jun 20, 2025 -
nv-sd CI test failure
#7310 closed
Jun 20, 2025 -
[BUG] FLOPS compute **FAILS** for `F.interpolate` when using `scale_factor`
#4504 closed
Jun 20, 2025 -
Bug when using optimizer and WarmupLR togather
#7303 closed
Jun 19, 2025 -
[BUG] Universal checkpoint conversion - "Cannot find layer_01* files in there"
#5776 closed
Jun 17, 2025 -
[BUG] No `universal_checkpoint_info` in the Accelerate+Deepspeed Checkpoint
#5430 closed
Jun 17, 2025
5 Issues opened by 5 people
-
[BUG] deepspeed v0.17.1 con't run well on NPU platform!
#7380 opened
Jun 23, 2025 -
[BUG] Memory leak when using adam_offload and save_checkpoint
#7370 opened
Jun 19, 2025 -
[BUG] init_inference loads qwen3-32b model very slow but train model loads it quickly
#7369 opened
Jun 18, 2025 -
FastPersist micro-benchmarks test results are inconsistent with expectations
#7368 opened
Jun 18, 2025
12 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
HF2UCP: Converting a `pytorch_model.bin` or `.safetensors` checkpoint to UCP
#7212 commented on
Jun 23, 2025 • 4 new comments -
[BUG] Qwen3: model loading failed when using meta device
#7275 commented on
Jun 18, 2025 • 0 new comments -
Functorch support: RuntimeError: In order to use an autograd.Function with functorch transforms
#7323 commented on
Jun 19, 2025 • 0 new comments -
AssertionError: no sync context manager is incompatible with gradientpartitioning logic of ZeRo stage 3
#6793 commented on
Jun 20, 2025 • 0 new comments -
Error when installing deepspeed with pip (Not sure if this is a bug or not)
#7358 commented on
Jun 23, 2025 • 0 new comments -
[BUG] DeepCompile in ZeRO-1 fails to do the forward pass
#7229 commented on
Jun 23, 2025 • 0 new comments -
nv-torch-nightly-v100 CI test failure
#7195 commented on
Jun 24, 2025 • 0 new comments -
nv-nightly CI test failure
#7140 commented on
Jun 24, 2025 • 0 new comments -
[BUG] Receiving CUDA error: invalid argument using pytorch 2.7 with deepspeed 0.16.4 with Cuda 12.8
#7150 commented on
Jun 24, 2025 • 0 new comments -
[BUG] Universal Checkpoint Conversion: Resumed Training Behaves as If Model Initialized from Scratch
#6691 commented on
Jun 24, 2025 • 0 new comments -
Avoid graph break by enabling compile of record module
#7362 commented on
Jun 23, 2025 • 0 new comments -
Fix ZeRO stage 1 and add stage 2 support with DeepCompile
#7366 commented on
Jun 23, 2025 • 0 new comments