-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a bug in the implementation of dequantization for inference #3433
Fix a bug in the implementation of dequantization for inference #3433
Conversation
Get rid of `hid_cnt` and simply set #blocks to output size / #groups
@loadams any update/ETA on this PR? FYI, I just rebased from master. |
Getting the right folks to review, sorry for the delay on getting this reviewed/merged. |
Hi @sakogan , Thanks for the PR. Sorry for my delay to get back to this. So, may I suggest that we revert these changes for this kernel and create another one that takes into account both hidden and output dimensions for getting the right group_index for reading the scale? For instance, if I understand correctly each 4K elements in the 768*768 matrix will have a different scale (considering 144 scales) Best, |
@RezaYazdaniAminabadi, thanks for the review! My code indeed assumes that the output dimension is divisible by the #groups. It even includes this assert that would trigger if this is not the case. I think I made a typo in my description -- I meant to say |
Thanks, this now makes sense 👍 |
* origin/master: (48 commits) Fix autotune to support Triton 2.1 (deepspeedai#4340) Fix skipped inference tests (deepspeedai#4336) Suppress noise (deepspeedai#4310) Fix a bug in the implementation of dequantization for inference (deepspeedai#3433) DS-Chat BLOOM: Fix Attention mask (deepspeedai#4338) clear redundant timers (deepspeedai#4308) Add release version checking (deepspeedai#4328) Fix Zero3 contiguous grads, reduce scatter false accuracy issue (deepspeedai#4321) Clean up modeling code (deepspeedai#4320) Handle empty parameter groups (deepspeedai#4277) Update README.md (deepspeedai#4316) README update (deepspeedai#4303) Update release and bump patch versioning flow (deepspeedai#4286) added a bert-model check for triton (deepspeedai#4266) ZeRO-Inference v2 release bump to 0.10.4 Update index.md (deepspeedai#4297) fix user args parsing of string with spaces on runner (deepspeedai#4265) ZeRO-Inference refresh (deepspeedai#4197) AMD Kernel Compatibility Fixes (deepspeedai#3180) ...
The implementation of dequantization for inference appears to be incorrect.
It sets the number of blocks to
ceil(output_size / hid_cnt*groups)
and the size of each block (akamerge_hidden
in dequantize_kernel) tohid_cnt*hidden_dim
. However, for some combinations of arguments tolaunch_dequantize
(e.g., withoutput_size=768
,hidden_dim=768
andgroups=144
, which corresponds to a BERT-base model) this leads to the creation of overlapping blocks indequantize_kernel
and hence erroneous calculation of dequantized input.The proposed solution is to get rid of
hid_cnt
and simply set the number of blocks tooutput_size / groups
, while the size of each block becomeshidden_dim
.The PR includes a simple unittest that demonstrates the problem (in the baseline implementation) and validates the proposed solution on a number of use cases.