Add Distributed, Parallel Dataset Merging #2391

fracapuano · 2025-11-06T00:53:40Z

What this does

This feature implements a highly scalable, parallel algorithm for merging a collection of input datasets into a single, concatenated dataset. The implementation is designed for a single, shared-memory machine (multi-core processor). However, the primitive dataset aggregation kernel _aggregate_datasets may as well be used in a distributed setting with datatrove.

Core Design

The algorithm uses tree-rased reduction.
Instead of a simple parallel "map" followed by a single-threaded "reduce" (which would be an $O(k)$ bottleneck, where $k$ is the number of datasets), we instead use a tree-based reduction.

The merge process is broken down into a series of dynamically-created tasks managed by a central thread pool and task queue. This ensures workers are constantly operating on hierarchically more aggregated datasets.

Phase 1 (Local Merges)

The initial set of lists is partitioned, and "Level 0" tasks (e.g., aggregate(d1, d2), aggregate(d3, d4)) are added to the task queue. Then, tasks are dispatched to workers.

Dynamic Task Creation

A pool of worker threads grabs these tasks. When a thread completes a task (e.g., producing partial_result_1), it coordinates with other completed tasks to create and enqueue a new "Level 1" task (e.g., merge(partial_result_1, partial_result_2)).

Parallel Aggregation

This process repeats, with threads continuously grabbing the next available merge task from any level of the tree. This ensures all CPU cores remain fully saturated.
The total aggregation time scales logarithmically, $O( \log k )$.

… then we raise. this is a first step towards standardazing the dataset format, or otherwise (as it is now) everything would be allowed

…ee-based thread pool

Copilot

Pull Request Overview

This PR adds parallel dataset merging capability to the LeRobot dataset aggregation system and fixes several issues with dataset creation and merging logic.

Introduces a num_workers parameter to enable parallel dataset aggregation using a tree-based reduction strategy
Fixes incorrect data file index mapping during aggregation by tracking actual destination files used
Replaces the previous pattern of merging default features with provided features with a validation check ensuring default features are present

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
src/lerobot/scripts/lerobot_edit_dataset.py	Adds `num_workers` configuration parameter to the merge operation config and passes it to the merge function
src/lerobot/datasets/lerobot_dataset.py	Changes feature merging logic from automatic merge to validation that default features are already present in provided features
src/lerobot/datasets/dataset_tools.py	Adds `num_workers` parameter to `merge_datasets` function and contains an incorrect Path constructor call
src/lerobot/datasets/aggregate.py	Implements parallel aggregation with ThreadPoolExecutor, adds file mapping logic to correctly track destination files during concatenation, and refactors the original function into `_aggregate_datasets`

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-06T00:56:14Z

src/lerobot/datasets/dataset_tools.py

    if repo_id is None:
        repo_id = f"{dataset.repo_id}_modified"
-    output_dir = Path(output_dir) if output_dir is not None else HF_LEROBOT_HOME / repo_id
+    output_dir = Path(output_dir, exists_ok=True) if output_dir is not None else HF_LEROBOT_HOME / repo_id


The Path constructor does not accept an exists_ok parameter. This parameter belongs to the Path.mkdir() method. This should be Path(output_dir) without the exists_ok argument.

Suggested change

output_dir = Path(output_dir, exists_ok=True) if output_dir is not None else HF_LEROBOT_HOME / repo_id

output_dir = Path(output_dir) if output_dir is not None else HF_LEROBOT_HOME / repo_id

Copilot · 2025-11-06T00:56:15Z

src/lerobot/datasets/lerobot_dataset.py


-        features = {**features, **DEFAULT_FEATURES}
+        assert set(DEFAULT_FEATURES.keys()).issubset(set(features.keys())), (
+            f"Default features must be a subset of the provided features! Got {set(features.keys())} and {set(DEFAULT_FEATURES.keys())}"


The error message format is confusing - it shows the features first and default features second, which is opposite to the assertion order and the message text. Consider: f\"Default features {set(DEFAULT_FEATURES.keys())} must be present in the provided features {set(features.keys())}\"

Suggested change

f"Default features must be a subset of the provided features! Got {set(features.keys())} and {set(DEFAULT_FEATURES.keys())}"

f"Default features {set(DEFAULT_FEATURES.keys())} must be present in the provided features {set(features.keys())}"

Copilot · 2025-11-06T00:56:15Z

src/lerobot/datasets/aggregate.py

+    if num_workers is None or num_workers <= 1:
+        # Run aggregation sequentially
+        _aggregate_datasets(
+            repo_ids=repo_ids,
+            aggr_repo_id=aggr_repo_id,
+            aggr_root=aggr_root,
+            roots=roots,
+            data_files_size_in_mb=data_files_size_in_mb,
+            video_files_size_in_mb=video_files_size_in_mb,
+            chunk_size=chunk_size,
+        )


The sequential path (when num_workers is None or <= 1) does not return a value or log completion, but the parallel path returns explicitly at line 417. This creates inconsistent behavior. Either add a return statement after line 307, or move the logging and return after the entire if/elif block to ensure both paths behave consistently.

Copilot · 2025-11-06T00:56:15Z

src/lerobot/datasets/aggregate.py

+        )
+
+    # Uses a parallel fan-out/fan-in strategy when num_workers is provided
+    elif num_workers > 1:


Test is always true, because of this condition.

Suggested change

elif num_workers > 1:

else:

fracapuano added 4 commits November 6, 2025 00:41

$@fracapuano$

add: number of workers for merging datasets

6c5660a

$@fracapuano$

add: num workers to dataset tools while you're at it

0c9fe64

$@fracapuano$

fix: DEFAULT FEATURES must be present when creating metadata. If not,…

a9e6ffa

… then we raise. this is a first step towards standardazing the dataset format, or otherwise (as it is now) everything would be allowed

$@fracapuano$

add: parallel, distributed aggregation of multiple datasets with a tr…

4fccb85

…ee-based thread pool

$@fracapuano$ fracapuano requested review from Copilot and imstevenpmwork November 6, 2025 00:53

Copilot AI reviewed Nov 6, 2025

View reviewed changes

$@fracapuano$ fracapuano added enhancement Suggestions for new features or improvements dataset Issues regarding data inputs, processing, or datasets performance Issues aimed at improving speed or resource usage labels Nov 6, 2025

$@fracapuano$ fracapuano self-assigned this Nov 6, 2025

$@fracapuano$

fix: tests

cae6202

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Distributed, Parallel Dataset Merging #2391

Add Distributed, Parallel Dataset Merging #2391

$@fracapuano$ fracapuano commented Nov 6, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 6, 2025

Uh oh!

Copilot AI Nov 6, 2025

Uh oh!

Copilot AI Nov 6, 2025

Uh oh!

Copilot AI Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	output_dir = Path(output_dir, exists_ok=True) if output_dir is not None else HF_LEROBOT_HOME / repo_id
	output_dir = Path(output_dir) if output_dir is not None else HF_LEROBOT_HOME / repo_id

	f"Default features must be a subset of the provided features! Got {set(features.keys())} and {set(DEFAULT_FEATURES.keys())}"
	f"Default features {set(DEFAULT_FEATURES.keys())} must be present in the provided features {set(features.keys())}"

Add Distributed, Parallel Dataset Merging #2391

Are you sure you want to change the base?

Add Distributed, Parallel Dataset Merging #2391

Conversation

fracapuano commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

Core Design

Phase 1 (Local Merges)

Dynamic Task Creation

Parallel Aggregation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

$@fracapuano$ fracapuano commented Nov 6, 2025 •

edited

Loading