Skip to content

Conversation

@SamuelBarryCS
Copy link
Contributor

@SamuelBarryCS SamuelBarryCS commented Sep 13, 2025

What

  • Fixes Feature Request: Option to transfer logits to CPU during generation #40794 by adding a parameter offload_logits_to_cpu to GenerationConfig which transfers logits and scores tensors to the CPUs after generation.
  • Frees up memory during large runs, trading decreased vRAM usage for CPU/ GPU communication time, enabling potentially higher batch size or seq length during training.
  • Adds test tests.generation.test_utils.test_offload_logits_to_cpu to test non regression

How to review

  • Read diff
  • Check that new tests test_offload_logits_to_cpu is correct and that tests are still passing

Testing performed

  • All existing tests still passing fine
  • tests.generation.test_utils.test_offload_logits_to_cpu passing fine as well:
(hf) samuel.barry@RNO:slurm-h100-reserved-rno-199-065:~/workspace/transformers(transfer-logits-to-cpu)$ python -m unittest tests.models.gpt2.test_modeling_gpt2.GPT2ModelTest.test_offload_logits_to_cpu -v
[2025-09-14 03:50:08,430] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False
test_offload_logits_to_cpu (tests.models.gpt2.test_modeling_gpt2.GPT2ModelTest) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.252s

OK

Benchmark

  • Developed memory_test.py that will be deleted before merging to showcase impact
  • Results with GPT2-large and max_new_tokens=1000 : ~50% reduction of additional peak memory usage for <2% time overhead.
Testing model: gpt2-large
Tokens to generate: 1000
Device: NVIDIA H100 80GB HBM3
Loading model...

Testing without CPU offloading...
Initial memory: 1512.4 MB
Peak memory: 1916.8 MB
Memory increase: 404.5 MB
Generation time: 10.08s
Tokens generated: 1000

Testing with CPU offloading...
Initial memory: 1544.4 MB
Peak memory: 1725.0 MB
Memory increase: 180.6 MB
Generation time: 9.87s
Tokens generated: 1000

Results:
Number of tokens generated: 1019
Memory saved: 223.9 MB
Memory reduction: 55.4%
Time overhead: -1.8%
Sequences match: True

@SamuelBarryCS SamuelBarryCS changed the title [WIP] Transfer logits to CPU Reduce vRAM usage by allowing transfer of generated logits to CPU Sep 14, 2025
@SamuelBarryCS SamuelBarryCS changed the title Reduce vRAM usage by allowing transfer of generated logits to CPU Reduce vRAM usage during generation by allowing to transfer logits to CPU Sep 14, 2025
@SamuelBarryCS SamuelBarryCS marked this pull request as ready for review September 14, 2025 03:45
@SamuelBarryCS
Copy link
Contributor Author

cc @gante @SunMarc for a quick review when you get the time! :)

@@ -0,0 +1,123 @@
"""
Copy link
Contributor Author

@SamuelBarryCS SamuelBarryCS Sep 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This script will of course be deleted before merging)

@YunruiZhang
Copy link

Thank you for the rapid response to the request! Since we’re at it, maybe it would be a good idea to do the same for output_attentions?

@SamuelBarryCS
Copy link
Contributor Author

SamuelBarryCS commented Sep 15, 2025

Thank you for the rapid response to the request! Since we’re at it, maybe it would be a good idea to do the same for output_attentions?

Very fair point @YunruiZhang
Done here d1691e5 :)
I feel like the parameter offload_logits_to_cpu isn't well named anymore. Maybe I should rename to offload_output_to_cpu or something like this, wdyt ?

@ArthurZucker ArthurZucker requested review from gante and removed request for ArthurZucker and Rocketknight1 September 15, 2025 10:08
@ArthurZucker
Copy link
Collaborator

cc @gante

Copy link
Contributor

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my full reply here: #40794 (comment)

TL;DR the feature is desirable! But it will clash with an ongoing refactor, it will be much simpler if we add the feature after the refactor 💛

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Option to transfer logits to CPU during generation

4 participants