-
Notifications
You must be signed in to change notification settings - Fork 31.5k
Description
Feature request
Currently, in model.generate, the Transformers implementation stores all logits on GPU until the generation is finished, at which point they are returned as a PyTorch tensor.
This design causes significant GPU memory usage, especially for long generations, since the logits for every token must remain in GPU memory until the end. As a result, the usable GPU memory is reduced, limiting batch size and sequence length when users want to return logits.
Proposed feature:
It would be very useful to add an option that transfers the logits to CPU at each step of generation (e.g., per token), and stores them as NumPy arrays (or CPU tensors). This would free up GPU memory during generation while still allowing users to access the logits afterwards.
Motivation
When using model.generate with output_scores=True, the logits for all generated tokens are accumulated on GPU until the generation finishes. For long sequences or larger models, this quickly consumes a large portion of GPU memory, which limits batch size, sequence length, and overall usability.
Your contribution
I’m happy to help, but I’m not very familiar with the current codebase.