Refactor output handling in generate for cleaner decoding methods #40887

manueldeprada · 2025-09-15T12:36:13Z

Each decoding method has a common block of output handling boilerplate that worsens readability:

output_attentions = generation_config.output_attentions
output_hidden_states = generation_config.output_hidden_states
output_scores = generation_config.output_scores
output_logits = generation_config.output_logits
return_dict_in_generate = generation_config.return_dict_in_generate

# init attention / hidden states / scores tuples
scores = () if (return_dict_in_generate and output_scores) else None
raw_logits = () if (return_dict_in_generate and output_logits) else None
decoder_attentions = () if (return_dict_in_generate and output_attentions) else None
cross_attentions = () if (return_dict_in_generate and output_attentions) else None
decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None

# if model is an encoder-decoder, retrieve encoder attention weights and hidden states
if return_dict_in_generate and self.config.is_encoder_decoder:
    encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None
    encoder_hidden_states = (
        model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None
    )

...

while not finished:
    # Store scores, attentions and hidden_states when required
    if return_dict_in_generate:
        if output_scores:
            scores += (next_token_scores,)
        if output_logits:
            raw_logits += (next_token_logits,)
        if output_attentions:
            decoder_attentions += (
                (outputs.decoder_attentions,) if self.config.is_encoder_decoder else (outputs.attentions,)
            )
            if self.config.is_encoder_decoder:
                cross_attentions += (outputs.cross_attentions,)

        if output_hidden_states:
            decoder_hidden_states += (
                (outputs.decoder_hidden_states,)
                if self.config.is_encoder_decoder
                else (outputs.hidden_states,)
            )

...
if return_dict_in_generate:
    if self.config.is_encoder_decoder:
        return XXXEncoderDecoderOutput(
            sequences=input_ids,
            scores=scores,
            logits=raw_logits,
            encoder_attentions=encoder_attentions,
            encoder_hidden_states=encoder_hidden_states,
            decoder_attentions=decoder_attentions,
            cross_attentions=cross_attentions,
            decoder_hidden_states=decoder_hidden_states,
            past_key_values=model_kwargs.get("past_key_values"),
        )
    else:
        return XXXDecoderOnlyOutput(
            sequences=input_ids,
            scores=scores,
            logits=raw_logits,
            attentions=decoder_attentions,
            hidden_states=decoder_hidden_states,
            past_key_values=model_kwargs.get("past_key_values"),
        )
else:
    return input_ids

This PR takes that boilerplate to reusable generate helpers

TODO: add generalization so that users can say output_x and x from forward gets forwarded.

…-output-handling

HuggingFaceDocBuilderDev · 2025-09-15T12:46:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

This reverts commit e3aed39.

manueldeprada · 2025-09-18T07:45:39Z

@gante this PR is more of a RFC to see what you think than a full PR. If you agree this simplifies generate, I will put more work to make it clean for assisted gen and make tests happy!

manueldeprada · 2025-09-18T13:49:32Z

related: #39834

gante

Very much on board with this 👍 👍 👍

src/transformers/generation/utils.py

…-output-handling

manueldeprada · 2025-11-04T11:02:13Z

@gante This is ready for review, left some comments in the code. The _accumulate method in this PR does not hardcode arg names (attentions, hidden states, etc) but rather iterate on a general dict. This is less readable but prepares the ground for custom output_xxx like image_hidden_states in a future PR (as suggested in #39834).

I would suggest merging this PR as-is, and then I can make a second PR that enables custom output_xxx. For that second PR, we need to agree what is the best interface for users to specify which extra args from model output they want. A big caveat also to be discussed is that assisted generation needs to use _split_model_outputs which might not work for special model outputs, so we might just not support extra outputs in assisted generation. WDYT?

manueldeprada · 2025-11-04T13:48:20Z

src/transformers/generation/utils.py

+        output_attentions = generation_config.output_attentions
+        output_hidden_states = generation_config.output_hidden_states
+        output_scores = generation_config.output_scores
+        output_logits = generation_config.output_logits


Since output_x comes from generation config, how do you suggest we enable extra generation outputs?

It could be a output_features=['attentions', 'hidden_states', 'scores'] etc

…-output-handling

manueldeprada · 2025-11-06T12:17:23Z

src/transformers/generation/utils.py

                "will be skipped."
            )
-
+        if can_compile:


this was missing from #40652 !! Just noticed while merging main here.

I added it so that it gets merged this week @gante otherwise I can push a separate fix!

zucchini-nlp

hey @manueldeprada , great job! I am happy to have a first step for better generation output handling.

Do you think we can make the dynamic output dict in this PR, since we already started the refactor? Would be super cool to get rid of near-duplicate code

zucchini-nlp · 2025-11-17T14:39:43Z

src/transformers/generation/utils.py

+        if not generation_config.return_dict_in_generate:
+            return {"return_dict_in_generate": False, "next_scores": None}
+        output_attentions = generation_config.output_attentions
+        output_hidden_states = generation_config.output_hidden_states
+        output_scores = generation_config.output_scores
+        output_logits = generation_config.output_logits
+
+        next_scores = () if output_scores else None
+        next_logits = () if output_logits else None
+        decoder_attentions = () if output_attentions else None
+        cross_attentions = () if output_attentions and self.config.is_encoder_decoder else None
+        decoder_hidden_states = () if output_hidden_states else None
+
+        encoder_attentions = encoder_hidden_states = None


i think we have to push further and make it output any value dynamically, as requested by users. Currently the PR splits out existing logic into its own fn but the existing code is very much repetitive

IMO we can check the model_outputs.keys() and dynamically update our generation output dict with the keys that are available in getattr(generation_config, f"output_{key}"). Since all models follow standard naming in output dict, it should have no edge cases

zucchini-nlp · 2025-11-17T14:40:43Z

src/transformers/generation/utils.py

+        if cur_len is not None:
+            for arg in splittable_args:
+                if generate_output.get(arg) is not None:
+                    kwargs[arg] = _split_model_outputs(
+                        kwargs[arg],
+                        cur_len,
+                        added_len,
+                        is_prefill_pass=len(generate_output[arg]) == 0,
+                        is_decoder_attention=(arg == "decoder_attentions"),
+                    )
+            for arg in cropable_args:
+                if generate_output[arg] is not None:
+                    kwargs[arg] = tuple(kwargs[arg][:, i, :] for i in range(added_len))
+        else:
+            for arg in all_args:
+                if generate_output.get(arg) is not None:
+                    kwargs[arg] = (kwargs[arg],)


hmm, this could be simplified no, if we set cur_len=1 as default. Then we can always try to split the output, it will catch up depending on length value

zucchini-nlp · 2025-11-17T14:42:01Z

src/transformers/generation/utils.py

+        if any(cache_key in model_kwargs for cache_key in ALL_CACHE_NAMES):
+            cache_key = next(cache_key for cache_key in ALL_CACHE_NAMES if cache_key in model_kwargs)
+            cache = model_kwargs[cache_key]


nit: with smth like caches_in_kwargs := [cache_key in model_kwargs for cache_key in ALL_CACHE_NAMES] we can avoid looping twice

zucchini-nlp · 2025-11-17T14:45:25Z

src/transformers/generation/utils.py

+            return encoder_decoder_cls(
+                sequences=sequences,
+                scores=generate_output["next_scores"],
+                logits=generate_output["next_logits"],
+                encoder_attentions=generate_output["encoder_attentions"],
+                encoder_hidden_states=generate_output["encoder_hidden_states"],
+                decoder_attentions=generate_output["decoder_attentions"],
+                cross_attentions=generate_output["cross_attentions"],
+                decoder_hidden_states=generate_output["decoder_hidden_states"],
+                past_key_values=cache,
+                **kwargs,
+            )
+        else:
+            return decoder_only_cls(
+                sequences=sequences,
+                scores=generate_output["next_scores"],
+                logits=generate_output["next_logits"],
+                attentions=generate_output["decoder_attentions"],
+                hidden_states=generate_output["decoder_hidden_states"],
+                past_key_values=cache,
+                **kwargs,
+            )
+


maybe for future, would be great to not rely on expected set of keys and unpack everything in generate_output to the output dict. The GenerationOutput dict would have to be able to output anything for that.

Pseudo code like below

output_cls = encoder_decoder_cls if self.config.is_encoder_decoder else decoder_cls return output_cls(sequences=sequences, past_key_values=cache, **generate_output)

manueldeprada added 2 commits September 15, 2025 14:31

draft

431f68e

Merge branch 'main' of github.com:huggingface/transformers into clean…

4c182c4

…-output-handling

manueldeprada added 4 commits September 15, 2025 14:53

blop blip blop

8a41d80

hmmm

e3aed39

old split_outputs

cab17b3

Revert "hmmm" (split_outputs prompt treatment)

7d8035e

This reverts commit e3aed39.

manueldeprada requested a review from gante September 17, 2025 19:36

gante reviewed Sep 18, 2025

View reviewed changes

This was referenced Sep 19, 2025

Allow extra outputs from GenerationMixin.generate #39834

Open

Feature Request: Option to transfer logits to CPU during generation #40794

Open

manueldeprada added 13 commits September 27, 2025 11:20

review

0bd27de

fix assisted gen accumulation

159637b

Merge branch 'main' of github.com:huggingface/transformers into clean…

96e6089

…-output-handling

simplify split

2644132

undo simplification due to gemma3n having weird hidden_state shapes

3db8e7d

fff

fce7896

finally working solution for assisted gen

72ffcdd

clean loop yasss

3006dfe

clean up

ce7471f

Merge remote-tracking branch 'origin/main' into clean-output-handling

b186a37

more cleaaan

3401c0c

make it general

743f8f7

fix bug, encoder states do not change

15d8229

manueldeprada marked this pull request as ready for review November 4, 2025 11:02

manueldeprada requested a review from gante November 4, 2025 11:02

manueldeprada commented Nov 4, 2025

View reviewed changes

manueldeprada added 2 commits November 6, 2025 13:10

Merge branch 'main' of github.com:huggingface/transformers into clean…

18806be

…-output-handling

ops wrong merge

84258fd

manueldeprada commented Nov 6, 2025

View reviewed changes

format

d9e4ab7

zucchini-nlp requested review from zucchini-nlp and removed request for gante November 10, 2025 13:07

zucchini-nlp reviewed Nov 17, 2025

View reviewed changes

Refactor output handling in generate for cleaner decoding methods #40887

Are you sure you want to change the base?

Refactor output handling in generate for cleaner decoding methods #40887

Uh oh!

Conversation

manueldeprada commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Sep 15, 2025

Uh oh!

manueldeprada commented Sep 18, 2025

Uh oh!

manueldeprada commented Sep 18, 2025

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

manueldeprada commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

manueldeprada Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

manueldeprada Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

manueldeprada commented Sep 15, 2025 •

edited

Loading

manueldeprada commented Nov 4, 2025 •

edited

Loading

manueldeprada Nov 6, 2025 •

edited

Loading