Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rust][datafusion] optimize count(*) queries on parquet sources #89

Closed
alamb opened this issue Apr 26, 2021 · 3 comments
Closed

[rust][datafusion] optimize count(*) queries on parquet sources #89

alamb opened this issue Apr 26, 2021 · 3 comments
Labels
datafusion Changes in the datafusion crate

Comments

@alamb
Copy link
Contributor

alamb commented Apr 26, 2021

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-8902

Currently, as far as I can tell, when you perform a select count(*) from dataset in datafusion against a parquet dataset, the way this is implemented is by doing a scan on column 0, and counting up all of the rows (specifically I think it counts the # of rows in each batch).

 

However, for the specific case of just counting everythign in a parquet file, you can just read the rowcount from the footer metadata, so it's O(1) instead of O(n)

@alamb alamb added the datafusion Changes in the datafusion crate label Apr 26, 2021
@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2021

Comment from Andrew Lamb(alamb) @ 2021-04-26T12:31:08.482+0000:

Migrated to github: https://github.com/apache/arrow-rs/issues/75

@Dandandan
Copy link
Contributor

This is implemented

@alamb
Copy link
Contributor Author

alamb commented Aug 18, 2021

Closed in #620

@alamb alamb closed this as completed Aug 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

No branches or pull requests

2 participants