fix: Consolidate Cache Handling + Fix DDP Multi-Indexing for huggingface datasets #569

bhimrazy · 2025-04-26T13:42:07Z

What does this PR do?

🛠️ Changes in this PR

Consolidate Cache Directories
- Removed .cache/litdata-cache-index-pq.
- Now, all chunks and index files are stored under DEFAULT_CACHE_DIR (or user-passed cache_dir).
- Users can still separately manage indexes if needed by manually calling index_hf_dataset and passing the cache_dir.
Fix DDP Multi-Indexing
- Only one process per node now handles indexing.
- Fixes race conditions, cache conflicts, and random DDP failures during streaming.
Minor Refactoring and Tests
- Small code cleanups and updates to related tests to match the new updates.

Related issues:

Fixes #562
Fixes multi cache dir issue

PR review:

Anyone in the community is welcome to review!

Did you have fun?

Absolutely! 🚀
It’s exciting improving LitData for smoother distributed and streaming workflows!

📋 Extra Note:

After this PR, the .cache/litdata-cache-index-pq directory becomes obsolete.
Users upgrading should manually delete it if needed to free up space.

for more information, see https://pre-commit.ci

… and error messages

…in CloudParquetDir and HFParquetDir

…est_parquet.py

…ectory usage

Copilot

Pull Request Overview

This PR consolidates cache directories and fixes DDP multi-indexing for Hugging Face datasets while also doing some test refactoring and minor cleanups.

Consolidates all cache handling under a common directory (or a user‐provided one) to simplify cache management.
Fixes race conditions in indexing by ensuring only one process per node handles the job.
Updates tests and refactors functions (e.g. switching to temporary directories when writing index files) for improved clarity and reliability.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/streaming/test_parquet.py	Updated cache directory handling in tests and adjusted function calls.
src/litdata/utilities/parquet.py	Refactored index file writing to use temporary directories and removed redundant prints.
src/litdata/utilities/hf_dataset.py	Changed function signature and logic for indexing HF datasets with filelocks.
src/litdata/utilities/dataset_utilities.py	Introduced generate_md5_hash to standardize hashing.
src/litdata/streaming/dataset.py	Updated index lookup logic for HF datasets to use provided cache_dir.

src/litdata/utilities/parquet.py

codecov · 2025-04-26T18:19:46Z

Codecov Report

Attention: Patch coverage is 98.27586% with 1 line in your changes missing coverage. Please review.

Project coverage is 79%. Comparing base (e789fb6) to head (a58d0b8).
Report is 1 commits behind head on main.

Additional details and impacted files

@@         Coverage Diff         @@
##           main   #569   +/-   ##
===================================
- Coverage    79%    79%   -0%     
===================================
  Files        40     40           
  Lines      6098   6111   +13     
===================================
- Hits       4818   4812    -6     
- Misses     1280   1299   +19

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

bhimrazy · 2025-04-26T18:43:28Z

Hi @philgzl,
would you mind giving this PR a try when you have a moment?
I'd also love to hear any feedback you might have.
Thanks 😊!

tchaton

There is still a risk of race condition when indexing an s3 bucket with multi node

philgzl · 2025-04-28T17:19:00Z

I am getting unexpected behaviors with this code:

import litdata as ld


def foo():
    dset = ld.StreamingDataset("hf://datasets/philgzl/ears/data/train-*.parquet")
    dloader = ld.StreamingDataLoader(dset)
    for i, _ in enumerate(dloader):
        if i == 100:
            print("done")
            break


if __name__ == "__main__":
    foo()
    foo()

On main I get the expected behavior; the second foo() runs almost instantly as it finds the cached data in the default cache dir. However on this branch the second foo() attempts to index the dataset again and then throws an error:

Indexing progress:  83%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                      | 43/52 [00:04<00:00, 10.18step/s]
Indexing progress: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [00:04<00:00, 10.88step/s]
Index created at /home/philgzl/.lightning/chunks/143564c1c17c53221f6f6a59fcb7ea8d/1745860357.1718044/index.json.
train-000-000000.parquet: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 63.8M/63.8M [00:05<00:00, 11.5MB/s]
done
Indexing progress:  65%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                                                             | 34/52 [00:02<00:01, 13.03step/s]
Indexing progress: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [00:03<00:00, 15.44step/s]
Index created at /home/philgzl/.lightning/chunks/143564c1c17c53221f6f6a59fcb7ea8d/1745860367.1151164/index.json.
train-000-000001.parquet: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 63.7M/63.7M [00:07<00:00, 8.34MB/s]
Exception in thread Thread-2:%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 63.7M/63.7M [00:07<00:00, 7.27MB/s]
Traceback (most recent call last):
  File "/usr/lib64/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/home/philgzl/dev/litData/src/litdata/streaming/reader.py", line 236, in run
    self._config.download_chunk_from_index(chunk_index)
  File "/home/philgzl/dev/litData/src/litdata/streaming/config.py", line 145, in download_chunk_from_index
    self._downloader.download_chunk_from_index(chunk_index)
  File "/home/philgzl/dev/litData/src/litdata/streaming/downloader.py", line 70, in download_chunk_from_index
    self.download_file(remote_chunkpath, local_chunkpath)
  File "/home/philgzl/dev/litData/src/litdata/streaming/downloader.py", line 288, in download_file
    shutil.copyfile(downloaded_path, temp_file_path)
  File "/usr/lib64/python3.12/shutil.py", line 262, in copyfile
    with open(dst, 'wb') as fdst:
         ^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/home/philgzl/.lightning/chunks/143564c1c17c53221f6f6a59fcb7ea8d/1745860357.1718044/train-000-000001.parquet.tmp'
train-000-000000.parquet: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 63.8M/63.8M [00:07<00:00, 8.71MB/s]
done

What's weird is that if I provide cache_dir="cache" to StreamingDataset then I get no errors.

I am getting this error in two different systems (Ubuntu (WSL) + Python 3.12.3 and Fedora + Python 3.12.9).

Didn't have time to look more into it.

bhimrazy · 2025-04-28T18:28:34Z

There is still a risk of race condition when indexing an s3 bucket with multi node

Umm, yes, thanks @tchaton — I hadn’t realized that. I’ll cover it shortly in a separate PR.

bhimrazy · 2025-04-28T18:44:16Z

Thank you @philgzl.
I think the indexing process exceeded the current lock timeout period, leading to race conditions during indexing.

I’ll also test it shortly. We might need to increase the lock timeout period, so that the process finishes the indexing process before the lock timeout.

src/litdata/utilities/hf_dataset.py

bhimrazy · 2025-05-02T08:59:54Z

On main I get the expected behavior; the second foo() runs almost instantly as it finds the cached data in the default cache dir. However on this branch the second foo() attempts to index the dataset again and then throws an error:
What's weird is that if I provide cache_dir="cache" to StreamingDataset then I get no errors.

Thanks @philgzl for catching this! 🙏
I had missed properly checking or setting the default cache_dir when no user-provided value is given. That’s why explicitly passing cache_dir="cache" avoided the re-indexing.

This should now be resolved, and the cache behavior should work as expected.

Indexing progress:  37%|███████████████████████████████▍                                                      | 19/52 [00:01<00:01, 17.47step/s]
Indexing progress: 100%|██████████████████████████████████████████████████████████████████████████████████████| 52/52 [00:01<00:00, 26.82step/s]
Index created at /cache/chunks/143564c1c17c53221f6f6a59fcb7ea8d/1746175939.1470342/index.json.
train-000-000000.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 63.8M/63.8M [00:00<00:00, 298MB/s]
train-000-000001.parquet:   0%|                                                                                     | 0.00/63.7M [00:00<?, ?B/s]done
Using existing index at /cache/chunks/143564c1c17c53221f6f6a59fcb7ea8d/1746175939.1470342.
train-000-000001.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 63.7M/63.7M [00:00<00:00, 228MB/s]
done

Thanks again for the sharp feedback!

… fixture

deependujha · 2025-05-06T13:35:13Z

Hi @bhimrazy , can you please let me know, where's the logic that prevents multiple processes from indexing HF dataset in ddp?

bhimrazy · 2025-05-06T13:57:04Z

Hi @bhimrazy , can you please let me know, where's the logic that prevents multiple processes from indexing HF dataset in ddp?

Sure @deependujha.

https://github.com/Lightning-AI/litData/pull/569/files#diff-e72f6de8a1274f83acce050ba4e129a720ea4525ebcabfda8a640b246eb30917R35-R38

    # Acquire a file lock to guarantee exclusive access,
    # ensuring that multiple processes do not create the index simultaneously.
    with suppress(Timeout), FileLock(os.path.join(tempfile.gettempdir(), "hf_index.lock"), timeout=20):
        # Check for existing index in the cache

It’s actually the lock that handles this. One of the processes creates a lock (blocks other processes from gaining the lock), completes the indexing, and then releases the lock.

Once it’s released, the other processes see that the index file already exists and skip the indexing step.

There’s still a potential issue if the first process doesn’t complete indexing within the set timeframe — in that case, other processes might start indexing too, leading to the same problem. But this is unlikely since most indexing finishes within 2–5 seconds.

bhimrazy and others added 11 commits April 26, 2025 19:18

fix: pass cache directory path to index_hf_dataset

6fc66e1

add generate_md5_hash function and refactor cache dir creation

9ef5c1e

ref: wip index hf dataset

0b95cd0

[pre-commit.ci] auto fixes from pre-commit.com hooks

71278a1

for more information, see https://pre-commit.ci

remove repeated log

0ebe406

refactor: improve index_hf_dataset function for better cache handling…

c57a00c

… and error messages

update

b78d954

refactor: enhance assertions for directory and cache path validation …

b97c009

…in CloudParquetDir and HFParquetDir

refactor: update cache directory handling and remove unused code in t…

cd63c62

…est_parquet.py

refactor: improve index writing in CloudParquetDir with temporary dir…

9d1e3ff

…ectory usage

fix tye errors

1010af8

bhimrazy marked this pull request as ready for review April 26, 2025 18:01

bhimrazy requested review from tchaton, lantiga and justusschock as code owners April 26, 2025 18:01

bhimrazy changed the title ~~[wip]: Fix/hf cache dir~~ fix: Consolidate Cache Handling + Fix DDP Multi-Indexing for huggingface datasets Apr 26, 2025

bhimrazy self-assigned this Apr 26, 2025

bhimrazy added enhancement New feature or request bugfix labels Apr 26, 2025

bhimrazy requested review from deependujha and Copilot April 26, 2025 18:12

Copilot AI reviewed Apr 26, 2025

View reviewed changes

src/litdata/utilities/parquet.py Show resolved Hide resolved

tchaton approved these changes Apr 28, 2025

View reviewed changes

Merge branch 'main' into fix/hf-cache-dir

fbe2114

bhimrazy commented May 2, 2025

View reviewed changes

src/litdata/utilities/hf_dataset.py Outdated Show resolved Hide resolved

bhimrazy added 2 commits May 2, 2025 12:17

Update lock timeout

24f8821

add missing default cache dir check

f9e8818

bhimrazy mentioned this pull request May 2, 2025

There is still a risk of race condition when indexing an s3 bucket with multi node (Parquet datasets) #578

Closed

fix: use get_default_cache_dir for cache path in clean_pq_index_cache…

a58d0b8

… fixture

tchaton merged commit 1dadc00 into Lightning-AI:main May 2, 2025
29 checks passed

bhimrazy deleted the fix/hf-cache-dir branch May 24, 2025 09:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Consolidate Cache Handling + Fix DDP Multi-Indexing for huggingface datasets #569

fix: Consolidate Cache Handling + Fix DDP Multi-Indexing for huggingface datasets #569

Uh oh!

bhimrazy commented Apr 26, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

codecov bot commented Apr 26, 2025 •

edited

Loading

Uh oh!

bhimrazy commented Apr 26, 2025

Uh oh!

tchaton left a comment

Uh oh!

philgzl commented Apr 28, 2025 •

edited

Loading

Uh oh!

bhimrazy commented Apr 28, 2025

Uh oh!

bhimrazy commented Apr 28, 2025

Uh oh!

Uh oh!

bhimrazy commented May 2, 2025

Uh oh!

Uh oh!

deependujha commented May 6, 2025

Uh oh!

bhimrazy commented May 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

fix: Consolidate Cache Handling + Fix DDP Multi-Indexing for huggingface datasets #569

fix: Consolidate Cache Handling + Fix DDP Multi-Indexing for huggingface datasets #569

Uh oh!

Conversation

bhimrazy commented Apr 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

🛠️ Changes in this PR

Related issues:

PR review:

Did you have fun?

📋 Extra Note:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

codecov bot commented Apr 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bhimrazy commented Apr 26, 2025

Uh oh!

tchaton left a comment

Choose a reason for hiding this comment

Uh oh!

philgzl commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bhimrazy commented Apr 28, 2025

Uh oh!

bhimrazy commented Apr 28, 2025

Uh oh!

Uh oh!

bhimrazy commented May 2, 2025

Uh oh!

Uh oh!

deependujha commented May 6, 2025

Uh oh!

bhimrazy commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bhimrazy commented Apr 26, 2025 •

edited

Loading

codecov bot commented Apr 26, 2025 •

edited

Loading

philgzl commented Apr 28, 2025 •

edited

Loading

bhimrazy commented May 6, 2025 •

edited

Loading