Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Compressedbackend for Onebit optimizers #5473

Merged
merged 19 commits into from
Jun 5, 2024

Conversation

Liangliang-Ma
Copy link
Contributor

In the process of adding onebit optimizers support for XPU devices, we have noticed that for different accelerator, the main difference of implementation of compressed_allreduce lies on packbits and unpackbits. CUDA uses cupy and NPU uses torch_npu. Instead of replace these to xpu only functions, we provided a CompressedBackend to do the compressed_allreduce work where users can add their own packbits/unpackbits kernels, which is a general path for all kinds of accelerators.

In this PR, we:

  1. Add CompressedBackend for onebitAdam, onebitLamb and zerooneAdam
  2. Add XPU implement of packbits/unpackbits with SYCL, built in PackbitsBuilder
  3. Add tests for onebit with CompressedBackend


at::Tensor packbits(at::Tensor tensor, int input_size, int rank)
{
/*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Liangliang-Ma the function documentation needs to be moved to line 39 right before the function def line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@delock
Copy link
Collaborator

delock commented May 7, 2024

@tjruwase this PR is an approach to abstract the generic part of 1bit-adam and implment accelerator dependent part with DeepSpeed custom op builder. So 1bit-adam does not need to depend on accelerator specific libraries.

@inkcherry I remember you investigated in 1bit adam portability before, FYI this PR implement a portable version of 1bit adam support.

@Liangliang-Ma Liangliang-Ma marked this pull request as draft May 11, 2024 08:50
@Liangliang-Ma Liangliang-Ma marked this pull request as ready for review May 14, 2024 06:44
@Liangliang-Ma
Copy link
Contributor Author

Hi @tjruwase , could you please help to review this PR? Thanks!

@Liangliang-Ma
Copy link
Contributor Author

Liangliang-Ma commented May 24, 2024

@tjruwase I have noticed that in onebit unit test, the onebit comm backend is assigned like this:
"comm_backend_name": get_accelerator().communication_backend_name().
But in fact, it's not the same thing, for get_accelerator().communication_backend_name() chooses which backend to do regular comm, like broadcast, allgather etc, while onebit backend specifies how to do error-compensated compression. We have MPI-based onebit backend not in accelerators' comm backend. And also I found that both HPU and NPU accelerator have 'hccl' communication_backend_name, which will lead to same onebit backend. So maybe we can change this part? How about we add a interface under accelerator like get_accelerator().onebit_backend() or change unit test to a more general way? Thanks.

@Liangliang-Ma
Copy link
Contributor Author

@tjruwase Hi, May I ask if you could help to review my last comment or merge this one first? Thanks

@tjruwase
Copy link
Contributor

tjruwase commented Jun 5, 2024

@Liangliang-Ma, apologies for delay. I am still thinking about your last comment, but will not delay this PR.

@tjruwase tjruwase added this pull request to the merge queue Jun 5, 2024
Merged via the queue into microsoft:master with commit 11a62a0 Jun 5, 2024
13 checks passed
sfc-gh-reyazda pushed a commit to Snowflake-Labs/DeepSpeed that referenced this pull request Jun 10, 2024
In the process of adding onebit optimizers support for XPU devices, we
have noticed that for different accelerator, the main difference of
implementation of `compressed_allreduce` lies on `packbits` and
`unpackbits`. CUDA uses cupy and NPU uses torch_npu. Instead of replace
these to xpu only functions, we provided a CompressedBackend to do the
`compressed_allreduce` work where users can add their own
packbits/unpackbits kernels, which is a general path for all kinds of
accelerators.

In this PR, we:
1. Add CompressedBackend for onebitAdam, onebitLamb and zerooneAdam
2. Add XPU implement of packbits/unpackbits with SYCL, built in
PackbitsBuilder
3. Add tests for onebit with CompressedBackend

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
loadams added a commit that referenced this pull request Jul 29, 2024
This one is document supplement for
#5473.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants