fix: multi-node parquet indexing #583

deependujha · 2025-05-06T14:49:05Z

Before submitting

Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes #578

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

Copilot

Pull Request Overview

This PR fixes issues with multi-node Parquet dataset indexing by introducing distributed synchronization using a new barrier utility and updating indexing logic across the Hugging Face dataset and Parquet writer modules.

Introduces a new maybe_barrier function in torch_utils to synchronize distributed processes.
Updates the index_hf_dataset function to better handle multi-node environments via _DistributedEnv and barrier synchronization.
Adjusts the index_parquet_dataset function in the streaming writer to ensure only one process writes the index in a distributed setting.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
src/litdata/utilities/torch_utils.py	Added a new maybe_barrier function for distributed process sync.
src/litdata/utilities/hf_dataset.py	Updated distributed synchronization and index retrieval logic.
src/litdata/streaming/writer.py	Integrated maybe_barrier and refined index creation in distributed mode.

Comments suppressed due to low confidence (1)

src/litdata/utilities/hf_dataset.py:39

The condition for reusing the existing index may fail in scenarios where nodes have an unequal number of processes. Consider refactoring this check to use a more robust distributed coordination mechanism.

if (env.num_nodes == 1 and env.global_rank == 0) or (env.num_nodes > 1 and env.global_rank % env.num_nodes == 0):

src/litdata/utilities/torch_utils.py

codecov · 2025-05-06T15:08:49Z

Codecov Report

Attention: Patch coverage is 92.72727% with 4 lines in your changes missing coverage. Please review.

Project coverage is 79%. Comparing base (76ec34b) to head (337f7a3).
Report is 1 commits behind head on main.

Additional details and impacted files

@@         Coverage Diff         @@
##           main   #583   +/-   ##
===================================
- Coverage    79%    79%   -0%     
===================================
  Files        40     41    +1     
  Lines      6112   6135   +23     
===================================
+ Hits       4819   4835   +16     
- Misses     1293   1300    +7

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull Request Overview

This PR fixes issues with multi-node indexing for Parquet datasets by improving process synchronization and avoiding concurrent index creation. Key changes include:

Adding synchronization utilities (maybe_barrier and is_local_rank_0) in torch_utils.
Updating the HF dataset indexing logic to use these utilities.
Adjusting the Parquet dataset writer to ensure proper barrier synchronization in distributed settings.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
src/litdata/utilities/torch_utils.py	Added utility functions for distributed synchronization.
src/litdata/utilities/hf_dataset.py	Updated cache index creation and synchronization via barriers.
src/litdata/streaming/writer.py	Modified distributed index creation with barrier synchronization.

Comments suppressed due to low confidence (1)

src/litdata/utilities/hf_dataset.py:34

Consider inserting a barrier (maybe_barrier()) before returning cache_directory to ensure that all processes are synchronized and the index is fully available before any process proceeds.

if cache_directory:

src/litdata/streaming/writer.py

src/litdata/utilities/hf_dataset.py

src/litdata/utilities/torch_utils.py

bhimrazy

Nice, @deependujha. It seems like a more solid solution.

Maybe, in case it works as a decorator, we could introduce another decorator, something like local_rank_zero_only, inspired by this.

src/litdata/streaming/writer.py

deependujha · 2025-05-07T09:06:02Z

I also thought of decorator inspired by lightning, but, right now, only interested in blocking a segment of code, and not a function.

Maybe in future, we might introduce it.

deependujha · 2025-05-08T10:45:44Z

For the failing docs in CI

Issue is open in Sphinx repo: sphinx-doc/sphinx#13533

deependujha added 2 commits May 6, 2025 19:46

update

d96c3ea

update

ca957db

deependujha requested review from tchaton, lantiga and justusschock as code owners May 6, 2025 14:49

deependujha requested a review from Copilot May 6, 2025 14:56

Copilot AI reviewed May 6, 2025

View reviewed changes

src/litdata/utilities/torch_utils.py Outdated Show resolved Hide resolved

deependujha added 2 commits May 6, 2025 23:45

more robust checking

aea0a35

add docstring

3841b5e

deependujha requested a review from Copilot May 6, 2025 18:18

Copilot AI reviewed May 6, 2025

View reviewed changes

src/litdata/streaming/writer.py Outdated Show resolved Hide resolved

deependujha commented May 6, 2025

View reviewed changes

src/litdata/utilities/hf_dataset.py Outdated Show resolved Hide resolved

deependujha added 4 commits May 6, 2025 23:52

Update src/litdata/utilities/hf_dataset.py

13433d3

update

9121028

update

a39fcb3

update

4bba52e

deependujha commented May 6, 2025

View reviewed changes

src/litdata/utilities/torch_utils.py Outdated Show resolved Hide resolved

src/litdata/utilities/torch_utils.py Outdated Show resolved Hide resolved

deependujha added 3 commits May 7, 2025 00:13

Update src/litdata/utilities/torch_utils.py

df7fe9a

Update src/litdata/utilities/torch_utils.py

6e7da89

update

8fff975

bhimrazy reviewed May 7, 2025

View reviewed changes

src/litdata/streaming/writer.py Outdated Show resolved Hide resolved

bhimrazy requested a review from Borda May 7, 2025 08:58

deependujha added 3 commits May 7, 2025 09:43

Refactor distributed indexing logic in parquet dataset handling

c27aab7

Merge branch 'main' into fix/multi-node-parquet-indexing

8debfdf

Increase CI timeout from 35 to 45 minutes

82a489f

bhimrazy approved these changes May 11, 2025

View reviewed changes

Merge branch 'main' into fix/multi-node-parquet-indexing

7d4ad54

Merge branch 'main' into fix/multi-node-parquet-indexing

337f7a3

deependujha merged commit 716af8b into Lightning-AI:main May 11, 2025
32 checks passed

deependujha deleted the fix/multi-node-parquet-indexing branch May 11, 2025 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: multi-node parquet indexing #583

fix: multi-node parquet indexing #583

Uh oh!

deependujha commented May 6, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

codecov bot commented May 6, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bhimrazy left a comment

Uh oh!

Uh oh!

deependujha commented May 7, 2025

Uh oh!

deependujha commented May 8, 2025

Uh oh!

Uh oh!

Uh oh!

fix: multi-node parquet indexing #583

fix: multi-node parquet indexing #583

Uh oh!

Conversation

deependujha commented May 6, 2025

What does this PR do?

PR review

Did you have fun?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

codecov bot commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bhimrazy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

deependujha commented May 7, 2025

Uh oh!

deependujha commented May 8, 2025

Uh oh!

Uh oh!

Uh oh!

codecov bot commented May 6, 2025 •

edited

Loading