You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This would enable users to avoid converting their dataset if they already have their dataset as parquet folders. We would need to run an indexation but this isn't too painful.
The text was updated successfully, but these errors were encountered:
Only if parquet files, Use pyarrow read_table function to load each parquet file one by one and then the writer will only write the number of files, and column types in index.json file, no chunk files created.
In reader,
All indices will remain as usual, only the reading at index i will be changed:
df.slice(7, 1).to_pandas().to_dict() # parquet file 7th index value
If a parquet dataset has no index.json file, we can still call the helper function to generate index.json on the fly and then StreamingDataset takes control.
Why no multithreading or multiprocessing while creating index.json file:
Parquet files once loaded in memory are uncompressed and may exceed memory limit.
Yes, that's what I had in mind. The main challenge will be to make the slicing and reading as fast as possible. Might be worth to use: https://github.com/pola-rs/polars
This would enable users to avoid converting their dataset if they already have their dataset as parquet folders. We would need to run an indexation but this isn't too painful.
The text was updated successfully, but these errors were encountered: