Optimizer state loading fix for bitsandbytes 8-bit optimizers. #1582

TimDettmers · 2021-11-22T06:10:03Z

This is a fix for loading bitsandbytes 8-bit optimizer with optimizer sharding. By default, DeepSpeed shards all tensors held by the optimizer state automatically. However, 8-bit optimizers also hold the quantization datatype (qmap1 and qmap2 in the case of Adam) without which the 8-bit state is undefined. The full datatype is needed to perform quantization and restore the optimizer state and if this tensor is sharded the optimizer state cannot be recovered.

This is a quick fix for the BigScience team. See this PR for more info. A more general fix would be to be able to specify in the config if some tensor keys do not need to be sharded. Alternatively, in this case, it would also suffice to disable sharding for tensors equal or smaller than the size of 256 elements (the size of the 8-bit data type).

stas00 · 2021-11-22T17:17:54Z

To make it generic we need some sort of a black-list of keys not to shard configurable by the user then. Probably exact match would be safer.

additional info: This is ZeRO-1!

@tjruwase, @jeffra

TimDettmers · 2021-11-22T18:13:33Z

To add a bit more context: I also tried to solve this on the bnb side by casting the tensor to a list before the optimizer is saved, but I could not figure out what the right hook was to trigger the tensor->list conversion. I expected DeepSpeed calls getstate of the optimizer to be used, or state_dict() but neither worked for me. What methods of the optimizer do get called when one saves a sharded optimizer to disk?

jeffra · 2021-12-01T01:37:58Z

I expected DeepSpeed calls getstate of the optimizer to be used, or state_dict() but neither worked for me. What methods of the optimizer do get called when one saves a sharded optimizer to disk?

For ZeRO we should be calling the optimizer's state_dict() method, this pending PR fixes this issue #1525 since we've seen issues in certain cases by not calling the optimizer state_dict like this. However the downside of doing this is then ZeRO does not support loading the checkpoint with a different data parallel (DP) size than what was used to save the checkpoint with. The code currently in master saves the checkpoint in an "elastic" way that strips the padding from the optimizer states so it can be restored with new DP sizes.

stas00 · 2021-12-01T01:46:17Z

However the downside of doing this is then ZeRO does not support loading the checkpoint with a different data parallel (DP) size than what was used to save the checkpoint with.

That would be a problem.

And this is precisely the problem with using BNB and Deepspeed together as of this moment. Once I start training with BNB, I'm stuck with the same DP until the end of the training.

stas00 · 2021-12-01T08:21:19Z

I have an idea. Instead of having code complexity to make the checkpoint elastic, why not make the checkpoint elastic instead?

That is optimize/simplify the code to work with a fixed PP/DP/TP checkpoint and have a mechanism to reconvert the degrees in the checkpoint as a "middle-ware". We already have the foundation tools in https://github.com/microsoft/Megatron-DeepSpeed/tree/main/tools/convert_checkpoint and I have already asked @tjruwase for a new feature to re-mold the checkpoint to a desired PP/DP/TP config (albeit only for weights, here we need optimizer states as well).

So the main difference I propose is that instead of saving the checkpoint to be generic we save it exact, and then have a tool to prepare it for a new PP/DP/TP layout and then launch of that newly converted checkpoint.

Besides, we have an issue with reconfiguring both TP and PP degrees anyway already, albeit PP should be easier to fix I think.

TimDettmers · 2021-12-01T15:55:35Z

I was thinking also about a potential other problem where the small quantization statistics are also split and communicated. This makes training slower it seems compared to 32-bit Adam. One way to solve this issue and the communication issue is to have a size-dependent "blocklist" for optimizer states. Stas told me that a blocklist already exists for weights. Would this be a solution that is reasonable? With that, all the issues with BNB optimizers could be fixed in one go.

stas00 · 2021-12-01T18:33:42Z

Stas told me that a blocklist already exists for weights.

It's not an explicit listing of names, but by param size:

https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training

stage3_param_persistence_threshold: integer
Do not partition parameters smaller than this threshold. Smaller values use less memory, but can greatly increase communication (especially latency-bound messages).

rocm-mici · 2022-06-09T20:19:00Z

Can one of the admins verify this patch?

Thomas-MMJ · 2022-10-09T19:59:45Z

just curious as to the status of this...

loadams · 2023-09-14T20:05:19Z

Hi @stas00 and @TimDettmers - any idea if this is still needed? If so, happy to fix the conflicts and review/merge. Otherwise closing out some older PRs here. Thanks for any context you have!

stas00 · 2023-09-14T20:38:06Z

Logan, did Deepspeed add a feature to allow users to configure a list of param names not to be sharded? If I'm not mistaken, based on Tim's OP, this is (was?) the main need for bitsnbytes to work with Deepspeed.

Unless @TimDettmers has already figured out a different solution. It has been quite a while since this Issue was encountered.

Fixed for bitsandbytes 8-bit optimizers.

70bdbff

TimDettmers requested review from awan-10, cli99, conglongli, eltonzheng, jeffra, minjiaz, niumanar, RezaYazdaniAminabadi, samyam, ShadenSmith and tjruwase as code owners November 22, 2021 06:10

TimDettmers changed the title ~~Fix for bitsandbytes 8-bit optimizers.~~ Optimizer state loading fix for bitsandbytes 8-bit optimizers. Nov 22, 2021

Merge branch 'master' into bnb_state_load_fix

8891d24

Thomas-MMJ mentioned this pull request Oct 9, 2022

DreamBooth DeepSpeed support for under 8 GB VRAM training huggingface/diffusers#735

Merged

Thomas-MMJ mentioned this pull request Oct 25, 2022

Dreambooth AUTOMATIC1111/stable-diffusion-webui#2002

Closed

jeffra requested a review from mrwyattii as a code owner June 23, 2023 21:31

loadams self-assigned this Sep 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizer state loading fix for bitsandbytes 8-bit optimizers. #1582

Optimizer state loading fix for bitsandbytes 8-bit optimizers. #1582

TimDettmers commented Nov 22, 2021

stas00 commented Nov 22, 2021 •

edited

Loading

TimDettmers commented Nov 22, 2021

jeffra commented Dec 1, 2021

stas00 commented Dec 1, 2021

stas00 commented Dec 1, 2021

TimDettmers commented Dec 1, 2021 •

edited

Loading

stas00 commented Dec 1, 2021

rocm-mici commented Jun 9, 2022

Thomas-MMJ commented Oct 9, 2022

loadams commented Sep 14, 2023

stas00 commented Sep 14, 2023

Optimizer state loading fix for bitsandbytes 8-bit optimizers. #1582

Are you sure you want to change the base?

Optimizer state loading fix for bitsandbytes 8-bit optimizers. #1582

Conversation

TimDettmers commented Nov 22, 2021

stas00 commented Nov 22, 2021 • edited Loading

TimDettmers commented Nov 22, 2021

jeffra commented Dec 1, 2021

stas00 commented Dec 1, 2021

stas00 commented Dec 1, 2021

TimDettmers commented Dec 1, 2021 • edited Loading

stas00 commented Dec 1, 2021

rocm-mici commented Jun 9, 2022

Thomas-MMJ commented Oct 9, 2022

loadams commented Sep 14, 2023

stas00 commented Sep 14, 2023

stas00 commented Nov 22, 2021 •

edited

Loading

TimDettmers commented Dec 1, 2021 •

edited

Loading