Skip to content

Optimize RLOO Trainer memory usage with string-level processing #3837

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

luckyvickyricky
Copy link

@luckyvickyricky luckyvickyricky commented Aug 2, 2025

What does this PR do?

This PR introduces a string-level processing optimization for the RLOO Trainer that reduces GPU memory usage by 55-59% on average. The optimization addresses critical memory inefficiency issues where the current implementation physically duplicates prompt tokens rloo_k times in GPU memory, causing OOM errors at higher rloo_k values.

Changes Made

Core optimization: Replace token-level duplication with string-level processing inspired by OnlineDPO:

# Before (inefficient): Token-level duplication
queries = queries.repeat(args.rloo_k, 1)  # Memory usage × rloo_k

# After (efficient): String-level processing  
repeated_prompts = prompts_text * rloo_k
queries = processing_class(repeated_prompts, ...)["input_ids"]

improvements

  • 55-59% reduction in GPU memory usage
  • Eliminates OOM errors at higher rloo_k values (e.g., rloo_k=8)
  • Full backward compatibility - no API changes required
  • No degradation in training quality

Test Results

Experimental validation on RTX 5090 with 3.1K token sequences:

Configuration Baseline Memory Optimized Memory Reduction
rloo_k=2 14.38GB 6.40GB 55.5%
rloo_k=4 26.99GB 11.01GB 59.2%
rloo_k=8 OOM Error 20.25GB Success
Screenshot 2025-08-02 at 2 43 25 PM

Reproducibility

Complete experimental framework available at: https://github.com/luckyvickyricky/trl-rloo/tree/feat/rloo_string-level-mem-eff

Quick reproduction with 4 scripts:

./setup_env.sh → ./run_baseline-rloo.sh → ./run_improve-rloo.sh → ./visualize_results.sh

Fixes #3829

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

@luckyvickyricky luckyvickyricky changed the title feat: implements Optimize RLOO Trainer memory usagenwith string-level processing Aug 2, 2025
@luckyvickyricky luckyvickyricky changed the title Optimize RLOO Trainer memory usagenwith string-level processing Optimize RLOO Trainer memory usage with string-level processing Aug 2, 2025
@shirinyamani
Copy link
Member

Hey @luckyvickyricky thanks for your contribution! this makes sense!
However, we are refactoring RLOO entirely to make it look like grpo expect for algorithm-specific part of the code which is mostly; 1. including kl in reward and 2. calculating the baseline and advantage = baseline - rewards
I'll ping you once the PR is ready here!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@luckyvickyricky
Copy link
Author

Thank you @shirinyamani for the explanation!

I'll wait for the refactoring to be completed. Whether during the process or afterward, I'd be happy to help with this optimization or anything else if possible.

Please ping me when #3801 is ready. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve RLOO Trainer memory efficiency through string-level processing optimization
3 participants