Skip to content

ENH: Read first n_rows of Parquet File #51830

Open
@Kalaweksh

Description

@Kalaweksh

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Using PyArrow to read the first n rows of a file, as suggested in #24511 could help more conveniently read part of a large DataFrame that may otherwise not fit into memory.

Feature Description

Add a parameter, n_rows, to read_parquet. Will use the ParquetFile.iterbatches() generator in the PyArrow implementation and ParquetFile.head() in the fastparquet implementation (although this would be purely for convenience and not have any performance benefits).

Alternative Solutions

You could currently use PyArrow:

from pyarrow.parquet import ParquetFile
import pyarrow as pa 

pf = ParquetFile('file_name.pq') 
first_ten_rows = next(pf.iter_batches(batch_size = 10)) 
df = pa.Table.from_batches([first_ten_rows]).to_pandas() 

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions