-
Notifications
You must be signed in to change notification settings - Fork 65
fix: Consolidate Cache Handling + Fix DDP Multi-Indexing for huggingface datasets #569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
for more information, see https://pre-commit.ci
… and error messages
…in CloudParquetDir and HFParquetDir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR consolidates cache directories and fixes DDP multi-indexing for Hugging Face datasets while also doing some test refactoring and minor cleanups.
- Consolidates all cache handling under a common directory (or a user‐provided one) to simplify cache management.
- Fixes race conditions in indexing by ensuring only one process per node handles the job.
- Updates tests and refactors functions (e.g. switching to temporary directories when writing index files) for improved clarity and reliability.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
tests/streaming/test_parquet.py | Updated cache directory handling in tests and adjusted function calls. |
src/litdata/utilities/parquet.py | Refactored index file writing to use temporary directories and removed redundant prints. |
src/litdata/utilities/hf_dataset.py | Changed function signature and logic for indexing HF datasets with filelocks. |
src/litdata/utilities/dataset_utilities.py | Introduced generate_md5_hash to standardize hashing. |
src/litdata/streaming/dataset.py | Updated index lookup logic for HF datasets to use provided cache_dir. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #569 +/- ##
===================================
- Coverage 79% 79% -0%
===================================
Files 40 40
Lines 6098 6111 +13
===================================
- Hits 4818 4812 -6
- Misses 1280 1299 +19 🚀 New features to boost your workflow:
|
Hi @philgzl, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still a risk of race condition when indexing an s3 bucket with multi node
I am getting unexpected behaviors with this code: import litdata as ld
def foo():
dset = ld.StreamingDataset("hf://datasets/philgzl/ears/data/train-*.parquet")
dloader = ld.StreamingDataLoader(dset)
for i, _ in enumerate(dloader):
if i == 100:
print("done")
break
if __name__ == "__main__":
foo()
foo() On
What's weird is that if I provide I am getting this error in two different systems (Ubuntu (WSL) + Python 3.12.3 and Fedora + Python 3.12.9). Didn't have time to look more into it. |
Umm, yes, thanks @tchaton — I hadn’t realized that. I’ll cover it shortly in a separate PR. |
Thank you @philgzl. I’ll also test it shortly. We might need to increase the lock timeout period, so that the process finishes the indexing process before the lock timeout. |
Thanks @philgzl for catching this! 🙏 This should now be resolved, and the cache behavior should work as expected. Indexing progress: 37%|███████████████████████████████▍ | 19/52 [00:01<00:01, 17.47step/s]
Indexing progress: 100%|██████████████████████████████████████████████████████████████████████████████████████| 52/52 [00:01<00:00, 26.82step/s]
Index created at /cache/chunks/143564c1c17c53221f6f6a59fcb7ea8d/1746175939.1470342/index.json.
train-000-000000.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 63.8M/63.8M [00:00<00:00, 298MB/s]
train-000-000001.parquet: 0%| | 0.00/63.7M [00:00<?, ?B/s]done
Using existing index at /cache/chunks/143564c1c17c53221f6f6a59fcb7ea8d/1746175939.1470342.
train-000-000001.parquet: 100%|█████████████████████████████████████████████████████████████████████████████| 63.7M/63.7M [00:00<00:00, 228MB/s]
done Thanks again for the sharp feedback! |
Hi @bhimrazy , can you please let me know, where's the logic that prevents multiple processes from indexing HF dataset in ddp? |
Sure @deependujha. # Acquire a file lock to guarantee exclusive access,
# ensuring that multiple processes do not create the index simultaneously.
with suppress(Timeout), FileLock(os.path.join(tempfile.gettempdir(), "hf_index.lock"), timeout=20):
# Check for existing index in the cache It’s actually the lock that handles this. One of the processes creates a lock (blocks other processes from gaining the lock), completes the indexing, and then releases the lock. Once it’s released, the other processes see that the index file already exists and skip the indexing step. There’s still a potential issue if the first process doesn’t complete indexing within the set timeframe — in that case, other processes might start indexing too, leading to the same problem. But this is unlikely since most indexing finishes within 2–5 seconds. |
What does this PR do?
🛠️ Changes in this PR
Consolidate Cache Directories
.cache/litdata-cache-index-pq
.DEFAULT_CACHE_DIR
(or user-passedcache_dir
).index_hf_dataset
and passing thecache_dir
.Fix DDP Multi-Indexing
Minor Refactoring and Tests
Related issues:
Fixes #562
Fixes multi cache dir issue
PR review:
Anyone in the community is welcome to review!
Did you have fun?
📋 Extra Note:
After this PR, the
.cache/litdata-cache-index-pq
directory becomes obsolete.Users upgrading should manually delete it if needed to free up space.