Skip to content

fix: Consolidate Cache Handling + Fix DDP Multi-Indexing for huggingface datasets #569

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
May 2, 2025

Conversation

bhimrazy
Copy link
Collaborator

@bhimrazy bhimrazy commented Apr 26, 2025

What does this PR do?

🛠️ Changes in this PR

  1. Consolidate Cache Directories

    • Removed .cache/litdata-cache-index-pq.
    • Now, all chunks and index files are stored under DEFAULT_CACHE_DIR (or user-passed cache_dir).
    • Users can still separately manage indexes if needed by manually calling index_hf_dataset and passing the cache_dir.
  2. Fix DDP Multi-Indexing

    • Only one process per node now handles indexing.
    • Fixes race conditions, cache conflicts, and random DDP failures during streaming.
  3. Minor Refactoring and Tests

    • Small code cleanups and updates to related tests to match the new updates.

Related issues:

Fixes #562
Fixes multi cache dir issue

PR review:

Anyone in the community is welcome to review!

Did you have fun?

Absolutely! 🚀
It’s exciting improving LitData for smoother distributed and streaming workflows!

📋 Extra Note:

After this PR, the .cache/litdata-cache-index-pq directory becomes obsolete.
Users upgrading should manually delete it if needed to free up space.

@bhimrazy bhimrazy marked this pull request as ready for review April 26, 2025 18:01
@bhimrazy bhimrazy changed the title [wip]: Fix/hf cache dir fix: Consolidate Cache Handling + Fix DDP Multi-Indexing for huggingface datasets Apr 26, 2025
@bhimrazy bhimrazy self-assigned this Apr 26, 2025
@bhimrazy bhimrazy added enhancement New feature or request bugfix labels Apr 26, 2025
@bhimrazy bhimrazy requested review from deependujha and Copilot April 26, 2025 18:12
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR consolidates cache directories and fixes DDP multi-indexing for Hugging Face datasets while also doing some test refactoring and minor cleanups.

  • Consolidates all cache handling under a common directory (or a user‐provided one) to simplify cache management.
  • Fixes race conditions in indexing by ensuring only one process per node handles the job.
  • Updates tests and refactors functions (e.g. switching to temporary directories when writing index files) for improved clarity and reliability.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/streaming/test_parquet.py Updated cache directory handling in tests and adjusted function calls.
src/litdata/utilities/parquet.py Refactored index file writing to use temporary directories and removed redundant prints.
src/litdata/utilities/hf_dataset.py Changed function signature and logic for indexing HF datasets with filelocks.
src/litdata/utilities/dataset_utilities.py Introduced generate_md5_hash to standardize hashing.
src/litdata/streaming/dataset.py Updated index lookup logic for HF datasets to use provided cache_dir.

Copy link

codecov bot commented Apr 26, 2025

Codecov Report

Attention: Patch coverage is 98.27586% with 1 line in your changes missing coverage. Please review.

Project coverage is 79%. Comparing base (e789fb6) to head (a58d0b8).
Report is 1 commits behind head on main.

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #569   +/-   ##
===================================
- Coverage    79%    79%   -0%     
===================================
  Files        40     40           
  Lines      6098   6111   +13     
===================================
- Hits       4818   4812    -6     
- Misses     1280   1299   +19     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bhimrazy
Copy link
Collaborator Author

Hi @philgzl,
would you mind giving this PR a try when you have a moment?
I'd also love to hear any feedback you might have.
Thanks 😊!

Copy link
Collaborator

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still a risk of race condition when indexing an s3 bucket with multi node

@philgzl
Copy link
Contributor

philgzl commented Apr 28, 2025

I am getting unexpected behaviors with this code:

import litdata as ld


def foo():
    dset = ld.StreamingDataset("hf://datasets/philgzl/ears/data/train-*.parquet")
    dloader = ld.StreamingDataLoader(dset)
    for i, _ in enumerate(dloader):
        if i == 100:
            print("done")
            break


if __name__ == "__main__":
    foo()
    foo()

On main I get the expected behavior; the second foo() runs almost instantly as it finds the cached data in the default cache dir. However on this branch the second foo() attempts to index the dataset again and then throws an error:

Indexing progress:  83%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                      | 43/52 [00:04<00:00, 10.18step/s]
Indexing progress: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [00:04<00:00, 10.88step/s]
Index created at /home/philgzl/.lightning/chunks/143564c1c17c53221f6f6a59fcb7ea8d/1745860357.1718044/index.json.
train-000-000000.parquet: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 63.8M/63.8M [00:05<00:00, 11.5MB/s]
done
Indexing progress:  65%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                                                                             | 34/52 [00:02<00:01, 13.03step/s]
Indexing progress: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [00:03<00:00, 15.44step/s]
Index created at /home/philgzl/.lightning/chunks/143564c1c17c53221f6f6a59fcb7ea8d/1745860367.1151164/index.json.
train-000-000001.parquet: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 63.7M/63.7M [00:07<00:00, 8.34MB/s]
Exception in thread Thread-2:%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 63.7M/63.7M [00:07<00:00, 7.27MB/s]
Traceback (most recent call last):
  File "/usr/lib64/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/home/philgzl/dev/litData/src/litdata/streaming/reader.py", line 236, in run
    self._config.download_chunk_from_index(chunk_index)
  File "/home/philgzl/dev/litData/src/litdata/streaming/config.py", line 145, in download_chunk_from_index
    self._downloader.download_chunk_from_index(chunk_index)
  File "/home/philgzl/dev/litData/src/litdata/streaming/downloader.py", line 70, in download_chunk_from_index
    self.download_file(remote_chunkpath, local_chunkpath)
  File "/home/philgzl/dev/litData/src/litdata/streaming/downloader.py", line 288, in download_file
    shutil.copyfile(downloaded_path, temp_file_path)
  File "/usr/lib64/python3.12/shutil.py", line 262, in copyfile
    with open(dst, 'wb') as fdst:
         ^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/home/philgzl/.lightning/chunks/143564c1c17c53221f6f6a59fcb7ea8d/1745860357.1718044/train-000-000001.parquet.tmp'
train-000-000000.parquet: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 63.8M/63.8M [00:07<00:00, 8.71MB/s]
done

What's weird is that if I provide cache_dir="cache" to StreamingDataset then I get no errors.

I am getting this error in two different systems (Ubuntu (WSL) + Python 3.12.3 and Fedora + Python 3.12.9).

Didn't have time to look more into it.

@bhimrazy
Copy link
Collaborator Author

There is still a risk of race condition when indexing an s3 bucket with multi node

Umm, yes, thanks @tchaton — I hadn’t realized that. I’ll cover it shortly in a separate PR.

@bhimrazy
Copy link
Collaborator Author

Thank you @philgzl.
I think the indexing process exceeded the current lock timeout period, leading to race conditions during indexing.

I’ll also test it shortly. We might need to increase the lock timeout period, so that the process finishes the indexing process before the lock timeout.

@bhimrazy
Copy link
Collaborator Author

bhimrazy commented May 2, 2025

On main I get the expected behavior; the second foo() runs almost instantly as it finds the cached data in the default cache dir. However on this branch the second foo() attempts to index the dataset again and then throws an error:
What's weird is that if I provide cache_dir="cache" to StreamingDataset then I get no errors.

Thanks @philgzl for catching this! 🙏
I had missed properly checking or setting the default cache_dir when no user-provided value is given. That’s why explicitly passing cache_dir="cache" avoided the re-indexing.

This should now be resolved, and the cache behavior should work as expected.

Indexing progress:  37%|███████████████████████████████▍                                                      | 19/52 [00:01<00:01, 17.47step/s]
Indexing progress: 100%|██████████████████████████████████████████████████████████████████████████████████████| 52/52 [00:01<00:00, 26.82step/s]
Index created at /cache/chunks/143564c1c17c53221f6f6a59fcb7ea8d/1746175939.1470342/index.json.
train-000-000000.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 63.8M/63.8M [00:00<00:00, 298MB/s]
train-000-000001.parquet:   0%|                                                                                     | 0.00/63.7M [00:00<?, ?B/s]done
Using existing index at /cache/chunks/143564c1c17c53221f6f6a59fcb7ea8d/1746175939.1470342.
train-000-000001.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 63.7M/63.7M [00:00<00:00, 228MB/s]
done

Thanks again for the sharp feedback!

@tchaton tchaton merged commit 1dadc00 into Lightning-AI:main May 2, 2025
29 checks passed
@deependujha
Copy link
Collaborator

Hi @bhimrazy , can you please let me know, where's the logic that prevents multiple processes from indexing HF dataset in ddp?

@bhimrazy
Copy link
Collaborator Author

bhimrazy commented May 6, 2025

Hi @bhimrazy , can you please let me know, where's the logic that prevents multiple processes from indexing HF dataset in ddp?

Sure @deependujha.

https://github.com/Lightning-AI/litData/pull/569/files#diff-e72f6de8a1274f83acce050ba4e129a720ea4525ebcabfda8a640b246eb30917R35-R38

    # Acquire a file lock to guarantee exclusive access,
    # ensuring that multiple processes do not create the index simultaneously.
    with suppress(Timeout), FileLock(os.path.join(tempfile.gettempdir(), "hf_index.lock"), timeout=20):
        # Check for existing index in the cache

It’s actually the lock that handles this. One of the processes creates a lock (blocks other processes from gaining the lock), completes the indexing, and then releases the lock.

Once it’s released, the other processes see that the index file already exists and skip the indexing step.

There’s still a potential issue if the first process doesn’t complete indexing within the set timeframe — in that case, other processes might start indexing too, leading to the same problem. But this is unlikely since most indexing finishes within 2–5 seconds.

@bhimrazy bhimrazy deleted the fix/hf-cache-dir branch May 24, 2025 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

All DDP processes attempt to index the HF dataset — should be limited to one per node HuggingFace not using specified cache_dir
4 participants