-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Compressedbackend for Onebit optimizers #5473
Conversation
csrc/xpu/packbits/packing.cpp
Outdated
|
||
at::Tensor packbits(at::Tensor tensor, int input_size, int rank) | ||
{ | ||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Liangliang-Ma the function documentation needs to be moved to line 39 right before the function def line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
@tjruwase this PR is an approach to abstract the generic part of 1bit-adam and implment accelerator dependent part with DeepSpeed custom op builder. So 1bit-adam does not need to depend on accelerator specific libraries. @inkcherry I remember you investigated in 1bit adam portability before, FYI this PR implement a portable version of 1bit adam support. |
Hi @tjruwase , could you please help to review this PR? Thanks! |
add README.md for onebit tests
@tjruwase I have noticed that in onebit unit test, the onebit comm backend is assigned like this: |
@tjruwase Hi, May I ask if you could help to review my last comment or merge this one first? Thanks |
@Liangliang-Ma, apologies for delay. I am still thinking about your last comment, but will not delay this PR. |
In the process of adding onebit optimizers support for XPU devices, we have noticed that for different accelerator, the main difference of implementation of `compressed_allreduce` lies on `packbits` and `unpackbits`. CUDA uses cupy and NPU uses torch_npu. Instead of replace these to xpu only functions, we provided a CompressedBackend to do the `compressed_allreduce` work where users can add their own packbits/unpackbits kernels, which is a general path for all kinds of accelerators. In this PR, we: 1. Add CompressedBackend for onebitAdam, onebitLamb and zerooneAdam 2. Add XPU implement of packbits/unpackbits with SYCL, built in PackbitsBuilder 3. Add tests for onebit with CompressedBackend --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
This one is document supplement for #5473. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
In the process of adding onebit optimizers support for XPU devices, we have noticed that for different accelerator, the main difference of implementation of
compressed_allreduce
lies onpackbits
andunpackbits
. CUDA uses cupy and NPU uses torch_npu. Instead of replace these to xpu only functions, we provided a CompressedBackend to do thecompressed_allreduce
work where users can add their own packbits/unpackbits kernels, which is a general path for all kinds of accelerators.In this PR, we: