Skip to content

Fix memory leak in LeRobotDataset._save_episode_table #1113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Leeez
Copy link

@Leeez Leeez commented May 15, 2025

Fix memory leak in LeRobotDataset

Description

This PR fixes a memory leak in the LeRobotDataset class that causes memory usage to grow continuously when processing multiple episodes sequentially.

Issue

When converting data to LeRobot format and processing multiple episodes in sequence, memory usage increases continuously without being released. Through memory profiling, I identified the issue in the _save_episode_table method, specifically with how dataset objects are accumulated in memory.

Solution

Modified the _save_episode_table method to:

  1. Save the current episode data to a parquet file as before
  2. Reset self.hf_dataset to a minimal empty dataset after saving, instead of accumulating data
  3. Explicitly release memory of temporary objects that are no longer needed

This approach preserves all functionality while preventing memory accumulation.

Validation

Tested with multiple episodes conversion:

  • Before fix: Memory usage grew from ~3.5GB to ~7.1GB+ when processing 5 episodes
  • After fix: Memory usage stabilized around ~3.5-4GB across all episodes

Leeez and others added 2 commits May 15, 2025 16:41
This commit fixes a memory leak in the LeRobotDataset class that caused
memory usage to grow continuously when processing multiple episodes.
The issue was in the _save_episode_table method where dataset objects
were being concatenated and retained in memory.

The fix resets self.hf_dataset after each episode is saved to disk,
preventing memory accumulation while preserving functionality.
@Cadene
Copy link
Collaborator

Cadene commented May 16, 2025

Nice I am taking this into account in: #969

@imstevenpmwork imstevenpmwork added enhancement Suggestions for new features or improvements dataset Issues regarding data inputs, processing, or datasets labels May 22, 2025
@atyshka
Copy link

atyshka commented Jul 8, 2025

Not sure when the V3 release is planned, but can we get this merged asap? I'm recording hundreds of sim episodes and this is definitely an issue.

@AdilZouitine
Copy link
Member

Not sure when the V3 release is planned, but can we get this merged asap? I'm recording hundreds of sim episodes and this is definitely an issue.

Hey sorry for the late reply. We will merge dataset v3 after the port of processors 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset Issues regarding data inputs, processing, or datasets enhancement Suggestions for new features or improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants