Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to read first n rows from parquet file. #24511

Closed
san089 opened this issue Dec 31, 2018 · 7 comments
Closed

Unable to read first n rows from parquet file. #24511

san089 opened this issue Dec 31, 2018 · 7 comments
Labels
IO Parquet parquet, feather

Comments

@san089
Copy link

san089 commented Dec 31, 2018

df = pd.read_parquet(path= 'filepath', nrows = 10)

Problem description

I have a parquet file and I want to read first n rows from the file into a pandas data frame. I did not find any way to do this in the documentation. I tried the 'nrows' and 'skiprows' parameter, but it did not work with the read_parquet() method. Do let me know if there is any way to achieve it that is not mentioned in the documentation.

Alternatively, I can read the complete parquet file and filter the first n rows, but that will require more computations which I want to avoid.

@TomAugspurger
Copy link
Contributor

Does pyarrow or fastparquet allow reading a subset of the file?

@san089
Copy link
Author

san089 commented Dec 31, 2018

Unfortunately, it does not.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 31, 2018 via email

@jreback jreback added the IO Parquet parquet, feather label Jan 1, 2019
@jreback jreback added this to the No action milestone Jan 1, 2019
@jreback
Copy link
Contributor

jreback commented Jan 1, 2019

closing as out of scope; we would need the engines to do this. Note that you might be able to simply filter by row groups.

@jreback jreback closed this as completed Jan 1, 2019
@AlJohri
Copy link

AlJohri commented Aug 27, 2019

Is there any chance we can define a flag like num_chunks? It takes N num_chunks from each column and limits to the column with the least amount of rows. Since I think 1 chunk from each column may result in a different number of rows depending on the data type.

Basically, it would be great to have a way to read just a small portion of a parquet file for local development.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 27, 2019 via email

@fny
Copy link

fny commented Sep 9, 2022

I want to reopen this issue. PyArrow now supports many of the required operations to make nrows possible and even usecols.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Parquet parquet, feather
Projects
None yet
Development

No branches or pull requests

5 participants