Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Read first n_rows of Parquet File #51830

Open
1 of 3 tasks
Kalaweksh opened this issue Mar 7, 2023 · 1 comment
Open
1 of 3 tasks

ENH: Read first n_rows of Parquet File #51830

Kalaweksh opened this issue Mar 7, 2023 · 1 comment
Labels

Comments

@Kalaweksh
Copy link

Kalaweksh commented Mar 7, 2023

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Using PyArrow to read the first n rows of a file, as suggested in #24511 could help more conveniently read part of a large DataFrame that may otherwise not fit into memory.

Feature Description

Add a parameter, n_rows, to read_parquet. Will use the ParquetFile.iterbatches() generator in the PyArrow implementation and ParquetFile.head() in the fastparquet implementation (although this would be purely for convenience and not have any performance benefits).

Alternative Solutions

You could currently use PyArrow:

from pyarrow.parquet import ParquetFile
import pyarrow as pa 

pf = ParquetFile('file_name.pq') 
first_ten_rows = next(pf.iter_batches(batch_size = 10)) 
df = pa.Table.from_batches([first_ten_rows]).to_pandas() 

Additional Context

No response

@Kalaweksh Kalaweksh added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 7, 2023
@Kalaweksh Kalaweksh changed the title ENH: ENH: Read first n_rows of Parquet File Mar 7, 2023
@jbrockmendel jbrockmendel added IO Parquet parquet, feather and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 1, 2023
@vaneseltine
Copy link

The structure of Parquet files includes "row groups," but these are not equivalent to rows. Apache's documentation explains:

Row group: A logical horizontal partitioning of the data into rows. There is no physical structure that is guaranteed for a row group. A row group consists of a column chunk for each column in the dataset.

The iter_batches mentioned above does create batches of rows, but it does so by reading row groups, not by changing the way the Parquet file format is fundamentally read.

There is no upper limit on the number of rows that can be contained in a row group, but one row per row group is likely rare. For example, when DuckDB writes Parquet the minimum row group size in 2,048 with a default of 122,880. Fastparquet's default is 50,000,000.

For these reasons, in my view, an nrows argument for read_parquet would be misleading at best. Parquet files are not sequential rows the way CSV files are, and users will be unpleasantly surprised to find that pandas reads and discards, for example, 49,999,900 rows just to return 100. (Or even more unpleasantly surprised to run out of memory in the attempt.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants