Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Using fsspec to download files #348

Merged

Conversation

deependujha
Copy link
Collaborator

@deependujha deependujha commented Sep 1, 2024

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Fixes #181

  • Basic setup and working for fsspec done and tested for S3 & GS.

  • Failing tests will be modified to match fsspec tests after approval.

  • Requirements.txt file will also be optimized after approval.


A small script to test

import litdata as ld

# modify it to azure or gcp endpoint, and give it a try (tested for s3 & gs)
s3_uri = 's3://my-dummy-bucket-litdata/deep_data/' 

dataset = ld.StreamingDataset(s3_uri, shuffle=True)
dataloader = ld.StreamingDataLoader(dataset)

for sample in dataloader:
    print(sample)
    print("="*80)
print("\nAll done\n")

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

src/litdata/streaming/downloader.py Outdated Show resolved Hide resolved
src/litdata/streaming/downloader.py Show resolved Hide resolved
src/litdata/streaming/downloader.py Show resolved Hide resolved
@deependujha deependujha marked this pull request as draft September 2, 2024 07:06
@deependujha
Copy link
Collaborator Author

tested successfully on S3 and GS for optimize(mode= none | append | overwrite), checkpoint, merge_datasets, streaming_dataset.

All the # todo: add support for other providers removed.

Copy link

codecov bot commented Sep 4, 2024

Codecov Report

Attention: Patch coverage is 55.76923% with 92 lines in your changes missing coverage. Please review.

Project coverage is 78%. Comparing base (2f78ec1) to head (b5ec077).
Report is 4 commits behind head on main.

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #348   +/-   ##
===================================
- Coverage    78%    78%   -0%     
===================================
  Files        34     33    -1     
  Lines      5008   4983   -25     
===================================
- Hits       3928   3890   -38     
- Misses     1080   1093   +13     

@bhimrazy
Copy link
Collaborator

Here's a sneak peek of my version of the benchmark tests. I think we might need to run a few more to ensure everything is in good shape.

Image

cc: @Borda @tchaton

Copy link

gitguardian bot commented Sep 18, 2024

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

Since your pull request originates from a forked repository, GitGuardian is not able to associate the secrets uncovered with secret incidents on your GitGuardian dashboard.
Skipping this check run and merging your pull request will create secret incidents on your GitGuardian dashboard.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
5685611 Triggered Generic High Entropy Secret 79a3ad8 tests/streaming/test_resolver.py View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@deependujha
Copy link
Collaborator Author

deependujha commented Sep 18, 2024

For error

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
boto3 1.35.21 requires botocore<1.36.0,>=1.35.21, but you have botocore 1.35.16 which is incompatible.
  • Run command
pip install -U boto boto3 botocore aiobotocore

Current update

  • Docs updated (fsspec storage options) ✅
  • Deleted commented files ✅
  • Tested on all three storage providers (S3, GCS, Azure blob) ✅

litdata-fsspec

@deependujha deependujha requested review from awaelchli and removed request for awaelchli September 18, 2024 08:03
@deependujha deependujha mentioned this pull request Sep 18, 2024
4 tasks
Copy link
Collaborator

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work !

@tchaton tchaton merged commit 719bae2 into Lightning-AI:main Sep 19, 2024
29 checks passed
@deependujha deependujha deleted the feat/using-fsspec-to-download-files branch September 19, 2024 08:19
tchaton added a commit that referenced this pull request Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using fsspec to download files
4 participants