You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using PyArrow to read the first n rows of a file, as suggested in #24511 could help more conveniently read part of a large DataFrame that may otherwise not fit into memory.
Feature Description
Add a parameter, n_rows, to read_parquet. Will use the ParquetFile.iterbatches() generator in the PyArrow implementation and ParquetFile.head() in the fastparquet implementation (although this would be purely for convenience and not have any performance benefits).
Alternative Solutions
You could currently use PyArrow:
from pyarrow.parquet import ParquetFile
import pyarrow as pa
pf = ParquetFile('file_name.pq')
first_ten_rows = next(pf.iter_batches(batch_size = 10))
df = pa.Table.from_batches([first_ten_rows]).to_pandas()
Additional Context
No response
The text was updated successfully, but these errors were encountered:
The structure of Parquet files includes "row groups," but these are not equivalent to rows. Apache's documentation explains:
Row group: A logical horizontal partitioning of the data into rows. There is no physical structure that is guaranteed for a row group. A row group consists of a column chunk for each column in the dataset.
The iter_batches mentioned above does create batches of rows, but it does so by reading row groups, not by changing the way the Parquet file format is fundamentally read.
There is no upper limit on the number of rows that can be contained in a row group, but one row per row group is likely rare. For example, when DuckDB writes Parquet the minimum row group size in 2,048 with a default of 122,880. Fastparquet's default is 50,000,000.
For these reasons, in my view, an nrows argument for read_parquet would be misleading at best. Parquet files are not sequential rows the way CSV files are, and users will be unpleasantly surprised to find that pandas reads and discards, for example, 49,999,900 rows just to return 100. (Or even more unpleasantly surprised to run out of memory in the attempt.)
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
Using PyArrow to read the first n rows of a file, as suggested in #24511 could help more conveniently read part of a large DataFrame that may otherwise not fit into memory.
Feature Description
Add a parameter, n_rows, to read_parquet. Will use the ParquetFile.iterbatches() generator in the PyArrow implementation and ParquetFile.head() in the fastparquet implementation (although this would be purely for convenience and not have any performance benefits).
Alternative Solutions
You could currently use PyArrow:
Additional Context
No response
The text was updated successfully, but these errors were encountered: