[WIP] Prefill-related logic in input preparation for generation #42088

zucchini-nlp · 2025-11-07T12:48:03Z

What does this PR do?

We always have had an imperfect way to infer if we're in prefill or decoding stage, which caused us many bugs in the past. The most reliable way is to check cache position values but it is not compile-compatible and also has an edge case

Recently Manuel merged a PR to split prefill into its own function so now we can benefit from it and know with 100% certainty which stage we're in. This PR adds is_prefill flag to generation input preparation and replaces existing logic with the flag.

Also it adds a test case for the above linked issue

github-actions · 2025-11-07T12:54:29Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: aria, aya_vision, chameleon, clvp, cohere2_vision, deepseek_vl, deepseek_vl_hybrid, emu3, florence2, fuyu, gemma3, gemma3n, glm4v, glm4v_moe, got_ocr2, granite_speech

HuggingFaceDocBuilderDev · 2025-11-07T12:57:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2025-11-07T14:15:20Z

Another worm of cans, assisted decoding has no prefill separated out and is causing issues now 😢

manueldeprada · 2025-11-10T12:50:18Z

Another worm of cans, assisted decoding has no prefill separated out and is causing issues now 😢

worm of cans?? 🤣 haha love it

Sooo this already arose on my PR. The main gist is that assisted generate does not prefill with the prompt tokens, but waits for the first batch of candidates and then prefills. Thus, we could not apply the standard prefill. But surely assisted_gen can pass the prefill flag on the first call, or we can also maybe call _prefill with the first batch of candidates.

zucchini-nlp · 2025-11-10T13:04:01Z

assisted_gen can pass the prefill flag on the first call

yeah, this seemed to be the easiest option. The only issue with VLMs is that we should not be passing certain inputs (pixels/etc) after a prefill phase. But with assistant model calling generate() many times internally, we end up with several "prefill" phases

add prefill arg in generation

77f4b60

add a slow test

423a9cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Prefill-related logic in input preparation for generation #42088

[WIP] Prefill-related logic in input preparation for generation #42088

zucchini-nlp commented Nov 7, 2025

Uh oh!

github-actions bot commented Nov 7, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 7, 2025

Uh oh!

zucchini-nlp commented Nov 7, 2025

Uh oh!

manueldeprada commented Nov 10, 2025

Uh oh!

zucchini-nlp commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[WIP] Prefill-related logic in input preparation for generation #42088

Are you sure you want to change the base?

[WIP] Prefill-related logic in input preparation for generation #42088

Conversation

zucchini-nlp commented Nov 7, 2025

What does this PR do?

Uh oh!

github-actions bot commented Nov 7, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 7, 2025

Uh oh!

zucchini-nlp commented Nov 7, 2025

Uh oh!

manueldeprada commented Nov 10, 2025

Uh oh!

zucchini-nlp commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants