Skip to content

Conversation

@manueldeprada
Copy link
Contributor

@manueldeprada manueldeprada commented Sep 3, 2025

As per title, isolate prefill from particular decoding methods into a separate function.

The only exception is assisted generation, since there is no clear prefill/decoding steps, as on each step a bunch of candidates with variable quantity of new tokens goes through the model.

Breaking changes: behavior of _beam_search and _sample changed for anyone subclassing them.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp
Copy link
Member

Yay, finally, can't wait to get this shipped 🎉

@manueldeprada
Copy link
Contributor Author

@zucchini-nlp look also at this beauty #40657 no more if generation_mode=blabla / elif / elif 💖 ✨ ✨

@manueldeprada manueldeprada changed the title Isolate prefill 🔴 Isolate prefill from generation loops Sep 4, 2025
Copy link
Contributor

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's going in the right direction, added a few comments :)

if is_prefill:
outputs = self(**model_inputs, return_dict=True)
is_prefill = False
if prefill_outputs is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can prepare the 1st round of inputs outside the decoding loop, which should result in a cleaner decoding loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not an easy task 😢:

  1. We cant move the prepare and forward calls to the bottom of the loop, since it is critical that _has_unfinished_sequences is checked just after updating this_peer_finished
  2. Another reason for prepare, forward and updateKwargs calls to be on top is that DeepSpeed's synced_gpus mode needs to run only that over and over if this peer finished but the others didnt
  3. With these constraints, we either need to unroll the whole first iteration of the loop or have some sort of if in the beginning of the loop that accounts for the prefill. I think it is less complex to have the if.

So I have kept the existing approach with a boolean variable for readability, WDYT?

@manueldeprada manueldeprada requested a review from gante September 15, 2025 07:09
@manueldeprada
Copy link
Contributor Author

run-slow: gemma

@github-actions
Copy link
Contributor

This comment contains run-slow, running the specified jobs:

models: ['models/gemma']
quantizations: [] ...

Copy link
Contributor

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me 👍

@manueldeprada manueldeprada marked this pull request as ready for review November 4, 2025 14:41
@manueldeprada manueldeprada requested a review from gante November 4, 2025 14:41
@manueldeprada
Copy link
Contributor Author

thanks @gante last bits corrected!! looking much better :)

@github-actions
Copy link
Contributor

github-actions bot commented Nov 4, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: csm, dia, musicgen, rag

Copy link
Contributor

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for iterating 🤗

(but let me check slow tests before merging)

@gante
Copy link
Contributor

gante commented Nov 5, 2025

run-slow: csm, dia, musicgen, rag, bart, gpt2, llama,

@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2025

This comment contains run-slow, running the specified jobs:

models: ["models/bart", "models/csm", "models/dia", "models/gpt2", "models/llama", "models/musicgen", "models/rag"]
quantizations: []

@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2025

CI Results

Workflow Run ⚙️

✅ No failing test specific to this PR 🎉 !

@manueldeprada
Copy link
Contributor Author

thanks @gante all green! (you have to merge it yourself, I dont have permissions to merge anymore😄)

@gante gante merged commit 571352d into huggingface:main Nov 5, 2025
25 checks passed
Abdennacer-Badaoui pushed a commit to Abdennacer-Badaoui/transformers that referenced this pull request Nov 10, 2025
* isolate-prefill: squash

* prefill inside decoding methods

* simplify autocompile helpers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants