Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference support for T5 and FLAN-T5 model families #8141
Inference support for T5 and FLAN-T5 model families #8141
Changes from 7 commits
45681a5
1c8d37a
bad0caf
c4ded1a
7293243
7d7fff4
6dc9eb4
78675f3
1d1cb01
7c610fa
b01ce7d
d40c9a1
03ab5dd
88270a3
ded682d
01cd5a6
8b560e6
9bcecf1
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my case, a prompt consists of a static part, which is unchanged and makes use of the KV cache, and dynamic part, which changes frequently. It works good with GPT, where I can call
llama_kv_cache_seq_rm
to cleanup the dynamic part of KV cache and start evaluating again. Would a similar approach work with T5? In other words, what's the degree of control over the encoder output? Thank you.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vladfaust No, encoder requires all input tokens to be present in the input batch. It's because the attention in encoder is not causal, so each token in the input sequence attends to all tokens in the input sequence. It doesn't even use KV cache because there's no need to.
I guess theoretically it would be possible to implement it in a way that would allow "adding" tokens to encoder output by calling llama_encode() multiple times, but the implementation would be much more complicated, definitely outside the scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, @fairydreaming: one of my use-cases is converting a growing chat history to some structured representation for each new message. Do I understand correctly that for now I'd have to encode the whole history again and again for each inference without any form of caching? (No offence, obviously, as I'm very grateful for the T5 support at all!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vladfaust Yes, there's no caching in the encoder, so if the input sequence grows even by one token, you have to encode it again and during this process all previous calculations for this token sequence are repeated.