Skip to content

Qualcomm AI Engine Direct - Set llama io as quantized tensor #5383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 28, 2024

Conversation

chunit-quic
Copy link
Collaborator

  • Add general function to tag io nodes obtain/genetate quantized tensor
  • Add quantizing io function to llama2.py

Copy link

pytorch-bot bot commented Sep 16, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5383

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 57846da with merge base ca47839 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 16, 2024
@haowhsu-quic
Copy link
Collaborator

Hi @cccclai, just a gentle ping. Please have a look on this when you are available, thank you.

@cccclai
Copy link
Contributor

cccclai commented Sep 27, 2024

Sorry I need to take a closer look at this one. My main concern is the change on export_llama_lib.py and I'm trying to see how to make it more structured and less backend specific code there

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@cccclai cccclai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry has been really late to this. I was worried this commit will get in the beta release and causes confusion.

get_custom_quant_ios_dtype,
)

tag_quant_io(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be part of _transform() function?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if sharding is enabled, this function should be executed after model sharding.
Because it will tag the tensors between sharding.

sharding_dtype=torch.uint16,
):
"""
This function is specific for llama inputs and outputs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand why it is specific to llama inputs/outputs. Is it because of the sharding of the model? Like the output of the first sharding doesn't need to dequant and the input of the second sharding doesn't need quant node?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try to make it clear.
This function is to clean the redundant Q/DQ nodes such as KV I/O and intermediate tensors between the sharding as you mentioned.
In original flow, we will quantize KV input and dequantize KV output every inference.
In fact, we don’t need to do this, we can directly output the quantized KV and put it into the model for the next inference.

@cccclai
Copy link
Contributor

cccclai commented Oct 16, 2024

Also can we rebase this PR?

@chunit-quic chunit-quic force-pushed the dev1/chunit/quantize_llama_ios branch from 6b81d26 to 21f1745 Compare October 21, 2024 03:06
@chunit-quic
Copy link
Collaborator Author

Also can we rebase this PR?

Hi Chen, I just rebased the PR. If there is any unclear part feel free to let me know. Thanks. :D

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@cccclai
Copy link
Contributor

cccclai commented Oct 21, 2024

Hi, it seems like 3 CIs are failing, likely due to the rebase issue, because test_eval_llama_wikitext.sh was landed on Oct 18. Do you mind rebasing again? Sorry for the inconvenience. It was mostly because this code change the export_llama script and I want to be safe.

Joey Tsai added 2 commits October 22, 2024 08:38
- Add general function to tag io obtain/genetate quantized tensor
- Add quantizing io function to llama2.py
@chunit-quic chunit-quic force-pushed the dev1/chunit/quantize_llama_ios branch from ebf32a2 to 57846da Compare October 22, 2024 00:39
@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@shewu-quic
Copy link
Collaborator

shewu-quic commented Oct 24, 2024

Hi @chunit-quic,
I have a question regarding the meta['spec'] of the delegate node. Previously, we used to run BuildQuantIO for to_executorch in static_llama to ensure the correct tensor size at runtime. Is this no longer necessary here?

@cccclai
Copy link
Contributor

cccclai commented Oct 25, 2024

Maybe I'll let @chunit-quic and @shewu-quic align on this before merging?

@chunit-quic
Copy link
Collaborator Author

Hi @chunit-quic, I have a question regarding the meta['spec'] of the delegate node. Previously, we used to run [BuildQuantIO]

Maybe I'll let @chunit-quic and @shewu-quic align on this before merging?

Thanks for pointing out. Basically this pass is not needed here. Execution time and accuracy are the same without it. Yet it might require bigger memory size(fp32) for tensors in CPU side. If it concerns us. Just simply add this pass to the passes list

@shewu-quic
Copy link
Collaborator

Thanks for your check. I think it is ok to me. Let's go to merge.

@cccclai cccclai merged commit b2f73a3 into pytorch:main Oct 28, 2024
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants