-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizer state loading fix for bitsandbytes 8-bit optimizers. #1582
base: master
Are you sure you want to change the base?
Conversation
To add a bit more context: I also tried to solve this on the bnb side by casting the tensor to a list before the optimizer is saved, but I could not figure out what the right hook was to trigger the tensor->list conversion. I expected DeepSpeed calls getstate of the optimizer to be used, or state_dict() but neither worked for me. What methods of the optimizer do get called when one saves a sharded optimizer to disk? |
For ZeRO we should be calling the optimizer's |
That would be a problem. And this is precisely the problem with using BNB and Deepspeed together as of this moment. Once I start training with BNB, I'm stuck with the same DP until the end of the training. |
I have an idea. Instead of having code complexity to make the checkpoint elastic, why not make the checkpoint elastic instead? That is optimize/simplify the code to work with a fixed PP/DP/TP checkpoint and have a mechanism to reconvert the degrees in the checkpoint as a "middle-ware". We already have the foundation tools in https://github.com/microsoft/Megatron-DeepSpeed/tree/main/tools/convert_checkpoint and I have already asked @tjruwase for a new feature to re-mold the checkpoint to a desired PP/DP/TP config (albeit only for weights, here we need optimizer states as well). So the main difference I propose is that instead of saving the checkpoint to be generic we save it exact, and then have a tool to prepare it for a new PP/DP/TP layout and then launch of that newly converted checkpoint. Besides, we have an issue with reconfiguring both TP and PP degrees anyway already, albeit PP should be easier to fix I think. |
I was thinking also about a potential other problem where the small quantization statistics are also split and communicated. This makes training slower it seems compared to 32-bit Adam. One way to solve this issue and the communication issue is to have a size-dependent "blocklist" for optimizer states. Stas told me that a blocklist already exists for weights. Would this be a solution that is reasonable? With that, all the issues with BNB optimizers could be fixed in one go. |
It's not an explicit listing of names, but by param size: https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
|
Can one of the admins verify this patch? |
just curious as to the status of this... |
Hi @stas00 and @TimDettmers - any idea if this is still needed? If so, happy to fix the conflicts and review/merge. Otherwise closing out some older PRs here. Thanks for any context you have! |
Logan, did Deepspeed add a feature to allow users to configure a list of param names not to be sharded? If I'm not mistaken, based on Tim's OP, this is (was?) the main need for bitsnbytes to work with Deepspeed. Unless @TimDettmers has already figured out a different solution. It has been quite a while since this Issue was encountered. |
This is a fix for loading bitsandbytes 8-bit optimizer with optimizer sharding. By default, DeepSpeed shards all tensors held by the optimizer state automatically. However, 8-bit optimizers also hold the quantization datatype (
qmap1
andqmap2
in the case of Adam) without which the 8-bit state is undefined. The full datatype is needed to perform quantization and restore the optimizer state and if this tensor is sharded the optimizer state cannot be recovered.This is a quick fix for the BigScience team. See this PR for more info. A more general fix would be to be able to specify in the config if some tensor keys do not need to be sharded. Alternatively, in this case, it would also suffice to disable sharding for tensors equal or smaller than the size of 256 elements (the size of the 8-bit data type).