Reduce vRAM usage during generation by allowing to transfer logits to CPU #40870

SamuelBarryCS · 2025-09-13T23:36:29Z

What

Fixes Feature Request: Option to transfer logits to CPU during generation #40794 by adding a parameter offload_logits_to_cpu to GenerationConfig which transfers logits and scores tensors to the CPUs after generation.
Frees up memory during large runs, trading decreased vRAM usage for CPU/ GPU communication time, enabling potentially higher batch size or seq length during training.
Adds test tests.generation.test_utils.test_offload_logits_to_cpu to test non regression

How to review

Read diff
Check that new tests test_offload_logits_to_cpu is correct and that tests are still passing

Testing performed

All existing tests still passing fine
tests.generation.test_utils.test_offload_logits_to_cpu passing fine as well:

(hf) samuel.barry@RNO:slurm-h100-reserved-rno-199-065:~/workspace/transformers(transfer-logits-to-cpu)$ python -m unittest tests.models.gpt2.test_modeling_gpt2.GPT2ModelTest.test_offload_logits_to_cpu -v
[2025-09-14 03:50:08,430] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False
test_offload_logits_to_cpu (tests.models.gpt2.test_modeling_gpt2.GPT2ModelTest) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.252s

OK

Benchmark

Developed memory_test.py that will be deleted before merging to showcase impact
Results with GPT2-large and max_new_tokens=1000 : ~50% reduction of additional peak memory usage for <2% time overhead.

Testing model: gpt2-large
Tokens to generate: 1000
Device: NVIDIA H100 80GB HBM3
Loading model...

Testing without CPU offloading...
Initial memory: 1512.4 MB
Peak memory: 1916.8 MB
Memory increase: 404.5 MB
Generation time: 10.08s
Tokens generated: 1000

Testing with CPU offloading...
Initial memory: 1544.4 MB
Peak memory: 1725.0 MB
Memory increase: 180.6 MB
Generation time: 9.87s
Tokens generated: 1000

Results:
Number of tokens generated: 1019
Memory saved: 223.9 MB
Memory reduction: 55.4%
Time overhead: -1.8%
Sequences match: True

SamuelBarryCS · 2025-09-14T03:48:07Z

cc @gante @SunMarc for a quick review when you get the time! :)

SamuelBarryCS · 2025-09-14T03:51:43Z

memory_test.py

@@ -0,0 +1,123 @@
+"""


(This script will of course be deleted before merging)

YunruiZhang · 2025-09-15T00:31:14Z

Thank you for the rapid response to the request! Since we’re at it, maybe it would be a good idea to do the same for output_attentions?

SamuelBarryCS · 2025-09-15T03:31:54Z

Thank you for the rapid response to the request! Since we’re at it, maybe it would be a good idea to do the same for output_attentions?

Very fair point @YunruiZhang
Done here d1691e5 :)
I feel like the parameter offload_logits_to_cpu isn't well named anymore. Maybe I should rename to offload_output_to_cpu or something like this, wdyt ?

ArthurZucker · 2025-09-15T10:08:12Z

cc @gante

gante

Please see my full reply here: #40794 (comment)

TL;DR the feature is desirable! But it will clash with an ongoing refactor, it will be much simpler if we add the feature after the refactor 💛

Add CPU offloading option

1d91cc9

SamuelBarryCS mentioned this pull request Sep 13, 2025

Feature Request: Option to transfer logits to CPU during generation #40794

Open

SamuelBarryCS added 3 commits September 14, 2025 02:43

Lint

8ace4f7

Add test

d0b381d

Push benchmarking script

b336858

SamuelBarryCS changed the title ~~[WIP] Transfer logits to CPU~~ Reduce vRAM usage by allowing transfer of generated logits to CPU Sep 14, 2025

SamuelBarryCS added 2 commits September 14, 2025 03:45

Update test file

d49c7f5

Update config

7140978

SamuelBarryCS changed the title ~~Reduce vRAM usage by allowing transfer of generated logits to CPU~~ Reduce vRAM usage during generation by allowing to transfer logits to CPU Sep 14, 2025

SamuelBarryCS marked this pull request as ready for review September 14, 2025 03:45

github-actions bot requested review from ArthurZucker and Rocketknight1 September 14, 2025 03:46

Merge branch 'main' into transfer-logits-to-cpu

1e180bf

SamuelBarryCS commented Sep 14, 2025

View reviewed changes

memory_test.py

@@ -0,0 +1,123 @@

"""

Copy link

Contributor Author

SamuelBarryCS Sep 14, 2025 •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This script will of course be deleted before merging)

Update tests

d1691e5

ArthurZucker requested review from gante and removed request for ArthurZucker and Rocketknight1 September 15, 2025 10:08

gante suggested changes Sep 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce vRAM usage during generation by allowing to transfer logits to CPU #40870

Reduce vRAM usage during generation by allowing to transfer logits to CPU #40870

Uh oh!

SamuelBarryCS commented Sep 13, 2025 •

edited

Loading

Uh oh!

SamuelBarryCS commented Sep 14, 2025

Uh oh!

SamuelBarryCS Sep 14, 2025 •

edited

Loading

Uh oh!

YunruiZhang commented Sep 15, 2025

Uh oh!

SamuelBarryCS commented Sep 15, 2025 •

edited

Loading

Uh oh!

ArthurZucker commented Sep 15, 2025

Uh oh!

gante left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Reduce vRAM usage during generation by allowing to transfer logits to CPU #40870

Are you sure you want to change the base?

Reduce vRAM usage during generation by allowing to transfer logits to CPU #40870

Uh oh!

Conversation

SamuelBarryCS commented Sep 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How to review

Testing performed

Benchmark

Uh oh!

SamuelBarryCS commented Sep 14, 2025

Uh oh!

SamuelBarryCS Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YunruiZhang commented Sep 15, 2025

Uh oh!

SamuelBarryCS commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Sep 15, 2025

Uh oh!

gante left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SamuelBarryCS commented Sep 13, 2025 •

edited

Loading

SamuelBarryCS Sep 14, 2025 •

edited

Loading

SamuelBarryCS commented Sep 15, 2025 •

edited

Loading

gante left a comment •

edited

Loading