-
Notifications
You must be signed in to change notification settings - Fork 537
Update static attention IO manager to use "smart mask" style update #9843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9843
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit ad4000b with merge base 1572381 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D72322014 |
…ytorch#9843) Summary: Pull Request resolved: pytorch#9843 Differential Revision: D72322014
This pull request was exported from Phabricator. Differential Revision: D72322014 |
…ytorch#9843) Summary: Pull Request resolved: pytorch#9843 Differential Revision: D72322014
This pull request was exported from Phabricator. Differential Revision: D72322014 |
* Update the internal data pointers using the cache updates returned by the | ||
* model. This length of each individual update cannot exceed the max update | ||
* length specified during the creation, and the total length cannot exceed | ||
* the context length. | ||
* length specified during creation, and the total length cannot exceed the | ||
* cache length. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the export llama script this was called max_context_length and max_seq_length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is on purpose, if you fix max context length and want to share the same cache between multiple methods with different input length (e.g. prefill + decode), you force yourself to also have different cache lengths as well. This makes things complicated when you need to switch between prefill and decode back and forth, as the wearable team has found out when trying to use QC's implementation.
Here cache length is fixed, and you combine it with different input length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I understand. I am just highlighting the nomenclature in export_llama. not questioning why it is done this way
Differential Revision: D72322014 Pull Request resolved: #9843
Differential Revision: D72322014