Skip to content

Better Grouping / aggregation pushdown #8699

Open
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

@devinjdangelo asked on slack
https://the-asf.slack.com/archives/C04RJ0C85UZ/p1703891500973189?thread_ts=1703891392.037839&cid=C04RJ0C85UZ

Related, is there any mechanism to push partial aggregations down to a table provider? E.g. “select count(*) from custom_provider” would push the aggregate expression down to the TableProvider rather than doing a scan.

Pushing grouping down in a plan is a classic analytic database optimization.

Among other things, it would allow doing fast data exploration like looking at counts / mins / maxes in different columns from only metadata.

Describe the solution you'd like

I would like someone to propose

  1. a change to TableProvider that would allow pushing aggregation down into the TableProider
  2. A basic optimizer pass that pushes aggregation down when possible

We shouldn't innovate on API here, but should look at what other engines support in this area (e.g. TRINO / Spark / Postgres) and follow them unless there is a good reason to do something different

A good initial proof of concept would be to push COUNT(*), MIN(col), MAX(col) type queries down into parquet or other table providers that can provide such values from statistics or metadata with minimal work / data fetch

For example

datafusion-cli -c "select count(*), min(bigint_col), max(bigint_col) from './parquet-testing/data/alltypes_tiny_pages.parquet';"
DataFusion CLI v34.0.0
+----------+--------------------------------------------------------------------+--------------------------------------------------------------------+
| COUNT(*) | MIN(./parquet-testing/data/alltypes_tiny_pages.parquet.bigint_col) | MAX(./parquet-testing/data/alltypes_tiny_pages.parquet.bigint_col) |
+----------+--------------------------------------------------------------------+--------------------------------------------------------------------+
| 7300     | 0                                                                  | 90                                                                 |
+----------+--------------------------------------------------------------------+--------------------------------------------------------------------+
1 row in set. Query took 0.007 seconds.

The optimizer framework should eventually extend to pushing more sophisticated groupings (like pushing grouping down below joins) but many

Describe alternatives you've considered

I believe Eager Aggregation and Lazy Aggregation is the classic paper on the optimizer portion

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions