Optimize RLOO Trainer memory usage with string-level processing #3837

luckyvickyricky · 2025-08-02T06:07:39Z

What does this PR do?

This PR introduces a string-level processing optimization for the RLOO Trainer that reduces GPU memory usage by 55-59% on average. The optimization addresses critical memory inefficiency issues where the current implementation physically duplicates prompt tokens rloo_k times in GPU memory, causing OOM errors at higher rloo_k values.

Changes Made

Core optimization: Replace token-level duplication with string-level processing inspired by OnlineDPO:

# Before (inefficient): Token-level duplication
queries = queries.repeat(args.rloo_k, 1)  # Memory usage × rloo_k

# After (efficient): String-level processing  
repeated_prompts = prompts_text * rloo_k
queries = processing_class(repeated_prompts, ...)["input_ids"]

improvements

55-59% reduction in GPU memory usage
Eliminates OOM errors at higher rloo_k values (e.g., rloo_k=8)
Full backward compatibility - no API changes required
No degradation in training quality

Test Results

Experimental validation on RTX 5090 with 3.1K token sequences:

Configuration	Baseline Memory	Optimized Memory	Reduction
rloo_k=2	14.38GB	6.40GB	55.5%
rloo_k=4	26.99GB	11.01GB	59.2%
rloo_k=8	OOM Error	20.25GB	Success

Reproducibility

Complete experimental framework available at: https://github.com/luckyvickyricky/trl-rloo/tree/feat/rloo_string-level-mem-eff

Quick reproduction with 4 scripts:

./setup_env.sh → ./run_baseline-rloo.sh → ./run_improve-rloo.sh → ./visualize_results.sh

Fixes #3829

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

shirinyamani · 2025-08-04T04:16:02Z

Hey @luckyvickyricky thanks for your contribution! this makes sense!
However, we are refactoring RLOO entirely to make it look like grpo expect for algorithm-specific part of the code which is mostly; 1. including kl in reward and 2. calculating the baseline and advantage = baseline - rewards
I'll ping you once the PR is ready here!

HuggingFaceDocBuilderDev · 2025-08-04T04:21:01Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

luckyvickyricky · 2025-08-04T14:08:41Z

Thank you @shirinyamani for the explanation!

I'll wait for the refactoring to be completed. Whether during the process or afterward, I'd be happy to help with this optimization or anything else if possible.

Please ping me when #3801 is ready. Thanks again!

feat: implements

1ffaae4

luckyvickyricky changed the title ~~feat: implements~~ Optimize RLOO Trainer memory usagenwith string-level processing Aug 2, 2025

luckyvickyricky changed the title ~~Optimize RLOO Trainer memory usagenwith string-level processing~~ Optimize RLOO Trainer memory usage with string-level processing Aug 2, 2025

Merge branch 'main' into feature/rloo-string-level

ce9c73b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize RLOO Trainer memory usage with string-level processing #3837

Optimize RLOO Trainer memory usage with string-level processing #3837

Uh oh!

luckyvickyricky commented Aug 2, 2025 •

edited

Loading

Uh oh!

shirinyamani commented Aug 4, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Aug 4, 2025

Uh oh!

luckyvickyricky commented Aug 4, 2025

Uh oh!

Uh oh!

Optimize RLOO Trainer memory usage with string-level processing #3837

Are you sure you want to change the base?

Optimize RLOO Trainer memory usage with string-level processing #3837

Uh oh!

Conversation

luckyvickyricky commented Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Changes Made

improvements

Test Results

Reproducibility

Before submitting

Who can review?

Uh oh!

shirinyamani commented Aug 4, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Aug 4, 2025

Uh oh!

luckyvickyricky commented Aug 4, 2025

Uh oh!

Uh oh!

luckyvickyricky commented Aug 2, 2025 •

edited

Loading