Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug in the implementation of dequantization for inference #3433

Merged
merged 8 commits into from
Sep 14, 2023

Conversation

sakogan
Copy link
Contributor

@sakogan sakogan commented May 3, 2023

The implementation of dequantization for inference appears to be incorrect.

It sets the number of blocks to ceil(output_size / hid_cnt*groups) and the size of each block (aka merge_hidden in dequantize_kernel) to hid_cnt*hidden_dim. However, for some combinations of arguments to launch_dequantize (e.g., with output_size=768, hidden_dim=768 and groups=144, which corresponds to a BERT-base model) this leads to the creation of overlapping blocks in dequantize_kernel and hence erroneous calculation of dequantized input.

The proposed solution is to get rid of hid_cnt and simply set the number of blocks to output_size / groups, while the size of each block becomes hidden_dim.

The PR includes a simple unittest that demonstrates the problem (in the baseline implementation) and validates the proposed solution on a number of use cases.

sakogan added 2 commits May 2, 2023 14:38
Get rid of `hid_cnt` and simply set #blocks to output size / #groups
@sakogan sakogan changed the title Dequantize bugfix Fix a bug in the implementation of dequantization for inference May 3, 2023
@sakogan sakogan marked this pull request as ready for review May 8, 2023 20:58
@sakogan
Copy link
Contributor Author

sakogan commented Jun 6, 2023

@loadams any update/ETA on this PR?

FYI, I just rebased from master.
Please, let me know if there is anything else I can do to facilitate reviewing/merging this PR.

@loadams
Copy link
Collaborator

loadams commented Jun 6, 2023

@loadams any update/ETA on this PR?

FYI, I just rebased from master. Please, let me know if there is anything else I can do to facilitate reviewing/merging this PR.

Getting the right folks to review, sorry for the delay on getting this reviewed/merged.

@RezaYazdaniAminabadi
Copy link
Contributor

Hi @sakogan ,

Thanks for the PR. Sorry for my delay to get back to this.
The issue that you raised here is valid and definitely needs to be addressed. However, this kernel is aimed for the token-wise dequantization, in which case the output dimension needs to be divisible by the #groups. In the example that you demonstrated the problem, that is not the case as 768 is not divisible by 144 (also, I am not seeing this case in your unit test), and thus the kernel implementation per se will not give you the right result.

So, may I suggest that we revert these changes for this kernel and create another one that takes into account both hidden and output dimensions for getting the right group_index for reading the scale? For instance, if I understand correctly each 4K elements in the 768*768 matrix will have a different scale (considering 144 scales)

Best,
Reza

@sakogan
Copy link
Contributor Author

sakogan commented Aug 29, 2023

@RezaYazdaniAminabadi, thanks for the review!

My code indeed assumes that the output dimension is divisible by the #groups. It even includes this assert that would trigger if this is not the case.

I think I made a typo in my description -- I meant to say groups=48 instead of groups=144 (the issue also exists in the attention module where QKV weights are concatenated and use 3 times the number of groups, i.e., 48*3 = 144, but then output size should be 3 times larger as well). Anyway, sorry about the confusion! The case of groups=48 is present in my unit test.

@RezaYazdaniAminabadi
Copy link
Contributor

Thanks, this now makes sense 👍

@RezaYazdaniAminabadi RezaYazdaniAminabadi added this pull request to the merge queue Sep 14, 2023
Merged via the queue into deepspeedai:master with commit 9bf7778 Sep 14, 2023
CurryRice233 pushed a commit to CurryRice233/DeepSpeed that referenced this pull request Sep 15, 2023
* origin/master: (48 commits)
  Fix autotune to support Triton 2.1  (deepspeedai#4340)
  Fix skipped inference tests (deepspeedai#4336)
  Suppress noise (deepspeedai#4310)
  Fix a bug in the implementation of dequantization for inference (deepspeedai#3433)
  DS-Chat BLOOM: Fix Attention mask (deepspeedai#4338)
  clear redundant timers (deepspeedai#4308)
  Add release version checking (deepspeedai#4328)
  Fix Zero3 contiguous grads, reduce scatter false  accuracy issue (deepspeedai#4321)
  Clean up modeling code (deepspeedai#4320)
  Handle empty parameter groups (deepspeedai#4277)
  Update README.md (deepspeedai#4316)
  README update (deepspeedai#4303)
  Update release and bump patch versioning flow (deepspeedai#4286)
  added a bert-model check for triton (deepspeedai#4266)
  ZeRO-Inference v2 release
  bump to 0.10.4
  Update index.md (deepspeedai#4297)
  fix user args parsing of string with spaces on runner (deepspeedai#4265)
  ZeRO-Inference refresh (deepspeedai#4197)
  AMD Kernel Compatibility Fixes (deepspeedai#3180)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants