Skip to content

Feat: Add support for parquet files #443

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Feb 3, 2025

Conversation

deependujha
Copy link
Collaborator

@deependujha deependujha commented Jan 6, 2025

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Fixes #191

  • Index parquet dataset stored in local or cloud (s3 or gs).
import litdata as ld

pq_data_uri = "gs://deep-litdata-parquet/my-parquet-data"

ld.index_parquet_dataset(pq_data_uri)
  • Use it as normal optimized dataset
import litdata as ld
from litdata.streaming.item_loader import ParquetLoader

ds = ld.StreamingDataset('gs://deep-litdata-parquet/my-parquet-data', item_loader = ParquetLoader())

for _ds in ds:
    print(f"{_ds=}")

Benchmark on Data prep machine

Screenshot 2025-02-01 at 4 18 41 PM

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@deependujha deependujha requested a review from tchaton as a code owner January 6, 2025 09:38
@deependujha deependujha marked this pull request as draft January 6, 2025 09:38
Copy link

codecov bot commented Jan 6, 2025

Codecov Report

Attention: Patch coverage is 64.70588% with 54 lines in your changes missing coverage. Please review.

Project coverage is 78%. Comparing base (ee77852) to head (59d9d14).
Report is 1 commits behind head on main.

Additional details and impacted files
@@         Coverage Diff          @@
##           main   #443    +/-   ##
====================================
- Coverage    78%    78%    -0%     
====================================
  Files        36     37     +1     
  Lines      5217   5365   +148     
====================================
+ Hits       4088   4182    +94     
- Misses     1129   1183    +54     

@tchaton
Copy link
Collaborator

tchaton commented Jan 24, 2025

Hey @deependujha Nice progress ;)

Copy link
Collaborator

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dope. If we could automatically index an s3 folder

And generate an index file, it would be dope.

import polars as pl
import fsspec

file_path = "s3://your-bucket/path/to/your-file.parquet"

# Open the Parquet file with fsspec
with fsspec.open(file_path, mode="rb") as f:
    # Fetch the number of rows from the metadata
    num_rows = pl.read_parquet(f, use_pyarrow=True).shape[0]
    print(f"Number of rows: {num_rows}")

Copy link

gitguardian bot commented Jan 28, 2025

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

Since your pull request originates from a forked repository, GitGuardian is not able to associate the secrets uncovered with secret incidents on your GitGuardian dashboard.
Skipping this check run and merging your pull request will create secret incidents on your GitGuardian dashboard.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
5685611 Triggered Generic High Entropy Secret 76efafb tests/streaming/test_resolver.py View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@deependujha deependujha marked this pull request as ready for review January 28, 2025 12:48
@deependujha deependujha changed the title WIP: Add support for parquet files & HF datasets Feat: Add support for parquet files Jan 28, 2025
@deependujha
Copy link
Collaborator Author

Adding support for directly consuming HF datasets is an exciting direction!

For HF datasets, my current idea involves iterating through all the Parquet datasets in the HF repository and creating an index.json file that is stored in a cache (since modifying the original dataset is not feasible).

When using the streaming dataset/dataloader, we would then pass this separate index.json file from the cache.

At this point, I'm uncertain about the exact approach for handling HF datasets comprehensively. This PR is ready for review and lays the groundwork for future enhancements. We can discuss HF dataset integration in a subsequent PR.

Copy link
Collaborator

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite awesome !

Copy link
Collaborator

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the benchmarks in the description ?

Copy link
Member

@justusschock justusschock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great!

@tchaton tchaton merged commit 7cbb3ef into Lightning-AI:main Feb 3, 2025
29 checks passed
Copy link
Collaborator

@lantiga lantiga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great Deepend

@@ -642,6 +642,37 @@ The `overwrite` mode will delete the existing data and start from fresh.

</details>

<details>
<summary> ✅ Index parquet datasets</summary>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Index" may not be immediately clear to users imo.
Ultimately what users get is the ability to "Stream Parquet datasets", I'd have this as the title. Index is a technical detail.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also add a line or two explaining how big of a deal this is : ) "Stream Parquet files directly without converting them to the LitData optimized binary format" or something of this nature.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've the changes. Since this PR was already merged, so the new changes are in: PR: #460

class ParquetLoader(BaseItemLoader):
def __init__(self) -> None:
if not _POLARS_AVAILABLE:
raise ModuleNotFoundError("Please, run: `pip install polars`")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to prepend "You are using the Parquet item loader, which depends on Polars. Please run: pip install polars"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to make bound checks on the version?

Copy link
Collaborator Author

@deependujha deependujha Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure on the exact version bound. To be on safer side for now, I've simply updated to polars>1.0.0.

Is this fine, or should I refine it further?

begin += curr_chunk["chunk_size"]
return intervals

def pre_load_chunk(self, chunk_index: int, chunk_filepath: str) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a fundamental reason why we're not pre-loading or is it just for sequencing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the oversight. Thanks for pointing that out!

I've made the necessary changes now to include pre-loading as suggested.

@deependujha deependujha mentioned this pull request Feb 4, 2025
4 tasks
@deependujha deependujha deleted the feat/add-hf-parquet-support branch February 4, 2025 06:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for parquet files for storing the chunks
4 participants