-
Notifications
You must be signed in to change notification settings - Fork 574
Qualcomm AI Engine Direct - Set llama io as quantized tensor #5383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qualcomm AI Engine Direct - Set llama io as quantized tensor #5383
Conversation
chunit-quic
commented
Sep 16, 2024
- Add general function to tag io nodes obtain/genetate quantized tensor
- Add quantizing io function to llama2.py
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5383
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 57846da with merge base ca47839 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Hi @cccclai, just a gentle ping. Please have a look on this when you are available, thank you. |
Sorry I need to take a closer look at this one. My main concern is the change on export_llama_lib.py and I'm trying to see how to make it more structured and less backend specific code there |
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry has been really late to this. I was worried this commit will get in the beta release and causes confusion.
get_custom_quant_ios_dtype, | ||
) | ||
|
||
tag_quant_io( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be part of _transform()
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if sharding is enabled, this function should be executed after model sharding.
Because it will tag the tensors between sharding.
sharding_dtype=torch.uint16, | ||
): | ||
""" | ||
This function is specific for llama inputs and outputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to understand why it is specific to llama inputs/outputs. Is it because of the sharding of the model? Like the output of the first sharding doesn't need to dequant and the input of the second sharding doesn't need quant node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try to make it clear.
This function is to clean the redundant Q/DQ nodes such as KV I/O and intermediate tensors between the sharding as you mentioned.
In original flow, we will quantize KV input and dequantize KV output every inference.
In fact, we don’t need to do this, we can directly output the quantized KV and put it into the model for the next inference.
Also can we rebase this PR? |
6b81d26
to
21f1745
Compare
Hi Chen, I just rebased the PR. If there is any unclear part feel free to let me know. Thanks. :D |
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Hi, it seems like 3 CIs are failing, likely due to the rebase issue, because |
- Add general function to tag io obtain/genetate quantized tensor - Add quantizing io function to llama2.py
ebf32a2
to
57846da
Compare
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Hi @chunit-quic, |
Maybe I'll let @chunit-quic and @shewu-quic align on this before merging? |
Thanks for pointing out. Basically this pass is not needed here. Execution time and accuracy are the same without it. Yet it might require bigger memory size(fp32) for tensors in CPU side. If it concerns us. Just simply add this pass to the passes list |
Thanks for your check. I think it is ok to me. Let's go to merge. |