Update static attention IO manager to use "smart mask" style update#9843
Update static attention IO manager to use "smart mask" style update#9843facebook-github-bot merged 1 commit intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9843
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit ad4000b with merge base 1572381 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D72322014 |
…ytorch#9843) Summary: Pull Request resolved: pytorch#9843 Differential Revision: D72322014
|
This pull request was exported from Phabricator. Differential Revision: D72322014 |
…ytorch#9843) Summary: Pull Request resolved: pytorch#9843 Differential Revision: D72322014
|
This pull request was exported from Phabricator. Differential Revision: D72322014 |
| * Update the internal data pointers using the cache updates returned by the | ||
| * model. This length of each individual update cannot exceed the max update | ||
| * length specified during the creation, and the total length cannot exceed | ||
| * the context length. | ||
| * length specified during creation, and the total length cannot exceed the | ||
| * cache length. |
There was a problem hiding this comment.
In the export llama script this was called max_context_length and max_seq_length
There was a problem hiding this comment.
This is on purpose, if you fix max context length and want to share the same cache between multiple methods with different input length (e.g. prefill + decode), you force yourself to also have different cache lengths as well. This makes things complicated when you need to switch between prefill and decode back and forth, as the wearable team has found out when trying to use QC's implementation.
Here cache length is fixed, and you combine it with different input length.
There was a problem hiding this comment.
yeah I understand. I am just highlighting the nomenclature in export_llama. not questioning why it is done this way
Differential Revision: D72322014 Pull Request resolved: #9843
Differential Revision: D72322014