Feature Request: Option to transfer logits to CPU during generation

### Feature request

Currently, in model.generate, the Transformers implementation stores all logits on GPU until the generation is finished, at which point they are returned as a PyTorch tensor.

This design causes significant GPU memory usage, especially for long generations, since the logits for every token must remain in GPU memory until the end. As a result, the usable GPU memory is reduced, limiting batch size and sequence length when users want to return logits.

Proposed feature:
It would be very useful to add an option that transfers the logits to CPU at each step of generation (e.g., per token), and stores them as NumPy arrays (or CPU tensors). This would free up GPU memory during generation while still allowing users to access the logits afterwards.

### Motivation

When using model.generate with output_scores=True, the logits for all generated tokens are accumulated on GPU until the generation finishes. For long sequences or larger models, this quickly consumes a large portion of GPU memory, which limits batch size, sequence length, and overall usability.


### Your contribution

I’m happy to help, but I’m not very familiar with the current codebase.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Option to transfer logits to CPU during generation #40794

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Option to transfer logits to CPU during generation #40794

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions