-
Notifications
You must be signed in to change notification settings - Fork 31.5k
🔴 Isolate prefill from generation loops #40652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Yay, finally, can't wait to get this shipped 🎉 |
|
@zucchini-nlp look also at this beauty #40657 no more if generation_mode=blabla / elif / elif 💖 ✨ ✨ |
gante
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it's going in the right direction, added a few comments :)
src/transformers/generation/utils.py
Outdated
| if is_prefill: | ||
| outputs = self(**model_inputs, return_dict=True) | ||
| is_prefill = False | ||
| if prefill_outputs is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can prepare the 1st round of inputs outside the decoding loop, which should result in a cleaner decoding loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not an easy task 😢:
- We cant move the prepare and forward calls to the bottom of the loop, since it is critical that
_has_unfinished_sequencesis checked just after updatingthis_peer_finished - Another reason for prepare, forward and updateKwargs calls to be on top is that DeepSpeed's
synced_gpusmode needs to run only that over and over if this peer finished but the others didnt - With these constraints, we either need to unroll the whole first iteration of the loop or have some sort of
ifin the beginning of the loop that accounts for the prefill. I think it is less complex to have the if.
So I have kept the existing approach with a boolean variable for readability, WDYT?
5385359 to
44274e6
Compare
|
run-slow: gemma |
|
This comment contains run-slow, running the specified jobs: models: ['models/gemma'] |
gante
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good to me 👍
|
thanks @gante last bits corrected!! looking much better :) |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: csm, dia, musicgen, rag |
gante
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you for iterating 🤗
(but let me check slow tests before merging)
|
run-slow: csm, dia, musicgen, rag, bart, gpt2, llama, |
|
This comment contains models: ["models/bart", "models/csm", "models/dia", "models/gpt2", "models/llama", "models/musicgen", "models/rag"] |
CI Results✅ No failing test specific to this PR 🎉 ! |
|
thanks @gante all green! (you have to merge it yourself, I dont have permissions to merge anymore😄) |
* isolate-prefill: squash * prefill inside decoding methods * simplify autocompile helpers
As per title, isolate prefill from particular decoding methods into a separate function.
The only exception is assisted generation, since there is no clear prefill/decoding steps, as on each step a bunch of candidates with variable quantity of new tokens goes through the model.
Breaking changes: behavior of
_beam_searchand_samplechanged for anyone subclassing them.