Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizer state loading fix for bitsandbytes 8-bit optimizers. #1582

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

TimDettmers
Copy link

This is a fix for loading bitsandbytes 8-bit optimizer with optimizer sharding. By default, DeepSpeed shards all tensors held by the optimizer state automatically. However, 8-bit optimizers also hold the quantization datatype (qmap1 and qmap2 in the case of Adam) without which the 8-bit state is undefined. The full datatype is needed to perform quantization and restore the optimizer state and if this tensor is sharded the optimizer state cannot be recovered.

This is a quick fix for the BigScience team. See this PR for more info. A more general fix would be to be able to specify in the config if some tensor keys do not need to be sharded. Alternatively, in this case, it would also suffice to disable sharding for tensors equal or smaller than the size of 256 elements (the size of the 8-bit data type).

@stas00
Copy link
Collaborator

stas00 commented Nov 22, 2021

To make it generic we need some sort of a black-list of keys not to shard configurable by the user then. Probably exact match would be safer.

additional info: This is ZeRO-1!

@tjruwase, @jeffra

@TimDettmers
Copy link
Author

To add a bit more context: I also tried to solve this on the bnb side by casting the tensor to a list before the optimizer is saved, but I could not figure out what the right hook was to trigger the tensor->list conversion. I expected DeepSpeed calls getstate of the optimizer to be used, or state_dict() but neither worked for me. What methods of the optimizer do get called when one saves a sharded optimizer to disk?

@TimDettmers TimDettmers changed the title Fix for bitsandbytes 8-bit optimizers. Optimizer state loading fix for bitsandbytes 8-bit optimizers. Nov 22, 2021
@jeffra
Copy link
Collaborator

jeffra commented Dec 1, 2021

I expected DeepSpeed calls getstate of the optimizer to be used, or state_dict() but neither worked for me. What methods of the optimizer do get called when one saves a sharded optimizer to disk?

For ZeRO we should be calling the optimizer's state_dict() method, this pending PR fixes this issue #1525 since we've seen issues in certain cases by not calling the optimizer state_dict like this. However the downside of doing this is then ZeRO does not support loading the checkpoint with a different data parallel (DP) size than what was used to save the checkpoint with. The code currently in master saves the checkpoint in an "elastic" way that strips the padding from the optimizer states so it can be restored with new DP sizes.

@stas00
Copy link
Collaborator

stas00 commented Dec 1, 2021

However the downside of doing this is then ZeRO does not support loading the checkpoint with a different data parallel (DP) size than what was used to save the checkpoint with.

That would be a problem.

And this is precisely the problem with using BNB and Deepspeed together as of this moment. Once I start training with BNB, I'm stuck with the same DP until the end of the training.

@stas00
Copy link
Collaborator

stas00 commented Dec 1, 2021

I have an idea. Instead of having code complexity to make the checkpoint elastic, why not make the checkpoint elastic instead?

That is optimize/simplify the code to work with a fixed PP/DP/TP checkpoint and have a mechanism to reconvert the degrees in the checkpoint as a "middle-ware". We already have the foundation tools in https://github.com/microsoft/Megatron-DeepSpeed/tree/main/tools/convert_checkpoint and I have already asked @tjruwase for a new feature to re-mold the checkpoint to a desired PP/DP/TP config (albeit only for weights, here we need optimizer states as well).

So the main difference I propose is that instead of saving the checkpoint to be generic we save it exact, and then have a tool to prepare it for a new PP/DP/TP layout and then launch of that newly converted checkpoint.

Besides, we have an issue with reconfiguring both TP and PP degrees anyway already, albeit PP should be easier to fix I think.

@TimDettmers
Copy link
Author

TimDettmers commented Dec 1, 2021

I was thinking also about a potential other problem where the small quantization statistics are also split and communicated. This makes training slower it seems compared to 32-bit Adam. One way to solve this issue and the communication issue is to have a size-dependent "blocklist" for optimizer states. Stas told me that a blocklist already exists for weights. Would this be a solution that is reasonable? With that, all the issues with BNB optimizers could be fixed in one go.

@stas00
Copy link
Collaborator

stas00 commented Dec 1, 2021

Stas told me that a blocklist already exists for weights.

It's not an explicit listing of names, but by param size:

https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training

stage3_param_persistence_threshold: integer
Do not partition parameters smaller than this threshold. Smaller values use less memory, but can greatly increase communication (especially latency-bound messages).

@rocm-mici
Copy link

Can one of the admins verify this patch?

@Thomas-MMJ
Copy link
Contributor

just curious as to the status of this...

@loadams
Copy link
Contributor

loadams commented Sep 14, 2023

Hi @stas00 and @TimDettmers - any idea if this is still needed? If so, happy to fix the conflicts and review/merge. Otherwise closing out some older PRs here. Thanks for any context you have!

@stas00
Copy link
Collaborator

stas00 commented Sep 14, 2023

Logan, did Deepspeed add a feature to allow users to configure a list of param names not to be sharded? If I'm not mistaken, based on Tim's OP, this is (was?) the main need for bitsnbytes to work with Deepspeed.

Unless @TimDettmers has already figured out a different solution. It has been quite a while since this Issue was encountered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants