Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Using fsspec to download files #348

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
96ec15d
fsspec basic setup done and working for s3
deependujha Sep 1, 2024
45b59ae
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 1, 2024
74dae21
fix storage option in fsspec
deependujha Sep 2, 2024
fcb4d95
pass down `storage_options` in dataset utilities
deependujha Sep 2, 2024
3080c2c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 2, 2024
c31e259
tested successfully on S3 and GS for (mode= none | append | overwrite…
deependujha Sep 3, 2024
0c761b1
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 3, 2024
2377983
fixed mypy errors and lock files when uploading/downloading
deependujha Sep 4, 2024
ffbf51d
update
deependujha Sep 4, 2024
de8b83b
fixed test `test_try_create_cache_dir`
deependujha Sep 4, 2024
e712327
fixed test: `test_reader_chunk_removal`
deependujha Sep 4, 2024
e118ba9
all tests passed
deependujha Sep 4, 2024
d3450dc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 4, 2024
ed0fff8
update
deependujha Sep 4, 2024
08236e8
update
deependujha Sep 4, 2024
12b049b
boto3 stop bothering me
deependujha Sep 4, 2024
d560d91
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 4, 2024
27644d3
update
deependujha Sep 4, 2024
bf06cf9
update
deependujha Sep 4, 2024
bdc13f4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 4, 2024
909b5cb
update
deependujha Sep 4, 2024
f661ae1
Merge branch 'main' into feat/using-fsspec-to-download-files
deependujha Sep 4, 2024
5671d11
tested on azure and made sure `storage_option` is working in all cases
deependujha Sep 5, 2024
2beebc9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 5, 2024
dd2e742
update
deependujha Sep 5, 2024
f555069
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 5, 2024
87a9556
update
deependujha Sep 5, 2024
8e9d448
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 5, 2024
555eb19
use s5cmd to download files if available
deependujha Sep 5, 2024
5ef4004
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 5, 2024
67205ea
add default storage_options
deependujha Sep 6, 2024
69fb43d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 6, 2024
b49a126
raise error if cloud is not supported
deependujha Sep 6, 2024
dbe8b0e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 6, 2024
5a81f04
update
deependujha Sep 6, 2024
848484a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 6, 2024
4d62fdd
fix windows error related to urllib parse scheme
deependujha Sep 6, 2024
6b961f7
Merge branch 'main' into feat/using-fsspec-to-download-files
deependujha Sep 6, 2024
e544d09
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 6, 2024
31f6e5a
Merge branch 'main' into feat/using-fsspec-to-download-files
bhimrazy Sep 16, 2024
5d1ec46
Merge branch 'main' into feat/using-fsspec-to-download-files
bhimrazy Sep 17, 2024
e68076d
cleanup commented code
deependujha Sep 18, 2024
79a3ad8
Merge branch 'main' into feat/using-fsspec-to-download-files
deependujha Sep 18, 2024
2036e37
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 18, 2024
e230ceb
update
deependujha Sep 18, 2024
e60f9ae
readme updated
deependujha Sep 18, 2024
feb5d48
increase test_dataset_resume_on_future_chunk timeout time to 120 seconds
deependujha Sep 18, 2024
b5ec077
update
deependujha Sep 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 22 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -217,9 +217,8 @@ Additionally, you can inject client connection settings for [S3](https://boto3.a
from litdata import StreamingDataset

storage_options = {
"endpoint_url": "your_endpoint_url",
"aws_access_key_id": "your_access_key_id",
"aws_secret_access_key": "your_secret_access_key",
"key": "your_access_key_id",
"secret": "your_secret_access_key",
}

dataset = StreamingDataset('s3://my-bucket/my-data', storage_options=storage_options)
Expand Down Expand Up @@ -264,33 +263,47 @@ for batch in val_dataloader:

 

The StreamingDataset supports reading optimized datasets from common cloud providers.
The StreamingDataset supports reading optimized datasets from common cloud providers.

```python
import os
import litdata as ld

# Read data from AWS S3
aws_storage_options={
"AWS_ACCESS_KEY_ID": os.environ['AWS_ACCESS_KEY_ID'],
"AWS_SECRET_ACCESS_KEY": os.environ['AWS_SECRET_ACCESS_KEY'],
"key": os.environ['AWS_ACCESS_KEY_ID'],
"secret": os.environ['AWS_SECRET_ACCESS_KEY'],
}
dataset = ld.StreamingDataset("s3://my-bucket/my-data", storage_options=aws_storage_options)

# Read data from GCS
gcp_storage_options={
"project": os.environ['PROJECT_ID'],
"token": {
# dumped from cat ~/.config/gcloud/application_default_credentials.json
"account": "",
"client_id": "your_client_id",
"client_secret": "your_client_secret",
"quota_project_id": "your_quota_project_id",
"refresh_token": "your_refresh_token",
"type": "authorized_user",
"universe_domain": "googleapis.com",
}
}
dataset = ld.StreamingDataset("gs://my-bucket/my-data", storage_options=gcp_storage_options)

# Read data from Azure
azure_storage_options={
"account_url": f"https://{os.environ['AZURE_ACCOUNT_NAME']}.blob.core.windows.net",
"credential": os.environ['AZURE_ACCOUNT_ACCESS_KEY']
"account_name": "azure_account_name",
"account_key": os.environ['AZURE_ACCOUNT_ACCESS_KEY']
}
dataset = ld.StreamingDataset("azure://my-bucket/my-data", storage_options=azure_storage_options)
```

- For more details on which storage options are supported, please refer to:
- [AWS S3 storage options](https://github.com/fsspec/s3fs/blob/main/s3fs/core.py#L176)
- [GCS storage options](https://github.com/fsspec/gcsfs/blob/main/gcsfs/core.py#L154)
- [Azure storage options](https://github.com/fsspec/adlfs/blob/main/adlfs/spec.py#L124)

</details>

<details>
Expand Down
4 changes: 3 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,7 @@ torch
lightning-utilities
filelock
numpy
boto3
# boto3
requests
fsspec
fsspec[s3] # aws s3
2 changes: 2 additions & 0 deletions requirements/extras.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,5 @@ pyarrow
tqdm
lightning-sdk ==0.1.17 # Must be pinned to ensure compatibility
google-cloud-storage
fsspec[gs] # google cloud storage
fsspec[abfs] # azure blob
1 change: 1 addition & 0 deletions src/litdata/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,3 +85,4 @@
_TIME_FORMAT = "%Y-%m-%d_%H-%M-%S.%fZ"
_IS_IN_STUDIO = bool(os.getenv("LIGHTNING_CLOUD_PROJECT_ID", None)) and bool(os.getenv("LIGHTNING_CLUSTER_ID", None))
_ENABLE_STATUS = bool(int(os.getenv("ENABLE_STATUS_REPORT", "0")))
_SUPPORTED_CLOUD_PROVIDERS = ["s3", "gs", "azure", "abfs"]
Loading
Loading