-
Notifications
You must be signed in to change notification settings - Fork 67
Feat: Add support for parquet files #443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Add support for parquet files #443
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #443 +/- ##
====================================
- Coverage 78% 78% -0%
====================================
Files 36 37 +1
Lines 5217 5365 +148
====================================
+ Hits 4088 4182 +94
- Misses 1129 1183 +54 |
Hey @deependujha Nice progress ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is dope. If we could automatically index an s3 folder
And generate an index file, it would be dope.
import polars as pl
import fsspec
file_path = "s3://your-bucket/path/to/your-file.parquet"
# Open the Parquet file with fsspec
with fsspec.open(file_path, mode="rb") as f:
# Fetch the number of rows from the metadata
num_rows = pl.read_parquet(f, use_pyarrow=True).shape[0]
print(f"Number of rows: {num_rows}")
|
GitGuardian id | GitGuardian status | Secret | Commit | Filename | |
---|---|---|---|---|---|
5685611 | Triggered | Generic High Entropy Secret | 76efafb | tests/streaming/test_resolver.py | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secret safely. Learn here the best practices.
- Revoke and rotate this secret.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
Adding support for directly consuming HF datasets is an exciting direction! For HF datasets, my current idea involves iterating through all the Parquet datasets in the HF repository and creating an index.json file that is stored in a cache (since modifying the original dataset is not feasible). When using the streaming dataset/dataloader, we would then pass this separate index.json file from the cache. At this point, I'm uncertain about the exact approach for handling HF datasets comprehensively. This PR is ready for review and lays the groundwork for future enhancements. We can discuss HF dataset integration in a subsequent PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite awesome !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the benchmarks in the description ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great Deepend
@@ -642,6 +642,37 @@ The `overwrite` mode will delete the existing data and start from fresh. | |||
|
|||
</details> | |||
|
|||
<details> | |||
<summary> ✅ Index parquet datasets</summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Index" may not be immediately clear to users imo.
Ultimately what users get is the ability to "Stream Parquet datasets", I'd have this as the title. Index is a technical detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also add a line or two explaining how big of a deal this is : ) "Stream Parquet files directly without converting them to the LitData optimized binary format" or something of this nature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've the changes. Since this PR was already merged, so the new changes are in: PR: #460
class ParquetLoader(BaseItemLoader): | ||
def __init__(self) -> None: | ||
if not _POLARS_AVAILABLE: | ||
raise ModuleNotFoundError("Please, run: `pip install polars`") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to prepend "You are using the Parquet item loader, which depends on Polars. Please run: pip install polars
"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to make bound checks on the version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure on the exact version bound. To be on safer side for now, I've simply updated to polars>1.0.0
.
Is this fine, or should I refine it further?
begin += curr_chunk["chunk_size"] | ||
return intervals | ||
|
||
def pre_load_chunk(self, chunk_index: int, chunk_filepath: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a fundamental reason why we're not pre-loading or is it just for sequencing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the oversight. Thanks for pointing that out!
I've made the necessary changes now to include pre-loading as suggested.
Before submitting
What does this PR do?
Fixes #191
Benchmark on Data prep machine
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃