Skip to content

Feature Request: Option to transfer logits to CPU during generation #40794

@YunruiZhang

Description

@YunruiZhang

Feature request

Currently, in model.generate, the Transformers implementation stores all logits on GPU until the generation is finished, at which point they are returned as a PyTorch tensor.

This design causes significant GPU memory usage, especially for long generations, since the logits for every token must remain in GPU memory until the end. As a result, the usable GPU memory is reduced, limiting batch size and sequence length when users want to return logits.

Proposed feature:
It would be very useful to add an option that transfers the logits to CPU at each step of generation (e.g., per token), and stores them as NumPy arrays (or CPU tensors). This would free up GPU memory during generation while still allowing users to access the logits afterwards.

Motivation

When using model.generate with output_scores=True, the logits for all generated tokens are accumulated on GPU until the generation finishes. For long sequences or larger models, this quickly consumes a large portion of GPU memory, which limits batch size, sequence length, and overall usability.

Your contribution

I’m happy to help, but I’m not very familiar with the current codebase.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions