Description
Is your feature request related to a problem or challenge?
When we perform a grouped aggregate on a filtered input (such as with TPC-H q1), the filter operator performs two main tasks:
- Evaluate the filter predicate (usually very fast)
- Create new batches and copy over the filtered data (very slow if the filter is not very selective, as in q1)
I wonder if we would see a significant performance improvement if we could avoid creating the filtered batches in this case.
One idea would be to create the filtered batches by copying the arrays and mutating the validity bitmap to hide the rows that are filtered out. This would potentially change the semantics in some cases though so we can probably only do this under certain conditions.
Another idea is to update the aggregate logic to perform the predicate evaluation and then use the resulting bitmap to determine which rows to accumulate.
Describe the solution you'd like
I am working on a small prototype of this, outside of DataFusion, that I will share once the code is less embarrassing.
Describe alternatives you've considered
It would be worth seeing how other engines handle this.
Additional context
No response