Skip to content

Conversation

@zucchini-nlp
Copy link
Member

What does this PR do?

Fixes #41863 and fixes #40910

We always have had an imperfect way to infer if we're in prefill or decoding stage, which caused us many bugs in the past. The most reliable way is to check cache position values but it is not compile-compatible and also has an edge case

Recently Manuel merged a PR to split prefill into its own function so now we can benefit from it and know with 100% certainty which stage we're in. This PR adds is_prefill flag to generation input preparation and replaces existing logic with the flag.

Also it adds a test case for the above linked issue

@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: aria, aya_vision, chameleon, clvp, cohere2_vision, deepseek_vl, deepseek_vl_hybrid, emu3, florence2, fuyu, gemma3, gemma3n, glm4v, glm4v_moe, got_ocr2, granite_speech

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp
Copy link
Member Author

Another worm of cans, assisted decoding has no prefill separated out and is causing issues now 😢

@manueldeprada
Copy link
Contributor

Another worm of cans, assisted decoding has no prefill separated out and is causing issues now 😢

worm of cans?? 🤣 haha love it

Sooo this already arose on my PR. The main gist is that assisted generate does not prefill with the prompt tokens, but waits for the first batch of candidates and then prefills. Thus, we could not apply the standard prefill. But surely assisted_gen can pass the prefill flag on the first call, or we can also maybe call _prefill with the first batch of candidates.

@zucchini-nlp
Copy link
Member Author

assisted_gen can pass the prefill flag on the first call

yeah, this seemed to be the easiest option. The only issue with VLMs is that we should not be passing certain inputs (pixels/etc) after a prefill phase. But with assistant model calling generate() many times internally, we end up with several "prefill" phases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants