[rust][datafusion] optimize count(*) queries on parquet sources #89

alamb · 2021-04-26T13:17:57Z

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-8902

Currently, as far as I can tell, when you perform a select count(*) from dataset in datafusion against a parquet dataset, the way this is implemented is by doing a scan on column 0, and counting up all of the rows (specifically I think it counts the # of rows in each batch).

However, for the specific case of just counting everythign in a parquet file, you can just read the rowcount from the footer metadata, so it's O(1) instead of O(n)

The text was updated successfully, but these errors were encountered:

alamb · 2021-04-26T13:17:58Z

Comment from Andrew Lamb(alamb) @ 2021-04-26T12:31:08.482+0000:

Migrated to github: https://github.com/apache/arrow-rs/issues/75

Dandandan · 2021-08-18T13:00:10Z

This is implemented

alamb · 2021-08-18T13:11:16Z

Closed in #620

alamb added the datafusion Changes in the datafusion crate label Apr 26, 2021

alamb closed this as completed Aug 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rust][datafusion] optimize count(*) queries on parquet sources #89

[rust][datafusion] optimize count(*) queries on parquet sources #89

alamb commented Apr 26, 2021

alamb commented Apr 26, 2021

Dandandan commented Aug 18, 2021

alamb commented Aug 18, 2021

[rust][datafusion] optimize count(*) queries on parquet sources #89

[rust][datafusion] optimize count(*) queries on parquet sources #89

Comments

alamb commented Apr 26, 2021

alamb commented Apr 26, 2021

Dandandan commented Aug 18, 2021

alamb commented Aug 18, 2021