Add Parquet RowFilter API #2335

tustvold · 2022-08-05T14:30:23Z

~~Draft as needs a lot more test coverage and general cleanup~~

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

This adds a RowFilter API and refines the existing RowSelection API. There are a couple of things worth highlighting here

The RowFilter is pushed down to the IO level. @crepererum gave a good use case of if this allows eliminating an entire column chunk from consideration, etc...
I think we need a RecordBatchReaderBuilder as the current API can't really be used for this purpose

Are there any user-facing changes?

tustvold · 2022-08-05T14:30:51Z

parquet/src/arrow/array_reader/builder.rs

    arrow_schema: SchemaRef,
    mask: ProjectionMask,
-    row_groups: Box<dyn RowGroupCollection>,
+    row_groups: &dyn RowGroupCollection,


Drive by cleanup

tustvold · 2022-08-05T14:31:00Z

parquet/src/arrow/array_reader/mod.rs

@@ -110,8 +110,8 @@ pub trait RowGroupCollection {
 }

 impl RowGroupCollection for Arc<dyn FileReader> {
-    fn schema(&self) -> Result<SchemaDescPtr> {
-        Ok(self.metadata().file_metadata().schema_descr_ptr())
+    fn schema(&self) -> SchemaDescPtr {


Drive by cleanup

tustvold · 2022-08-05T14:31:36Z

parquet/src/arrow/arrow_reader/filter.rs

+use arrow::record_batch::RecordBatch;
+
+/// A predicate operating on [`RecordBatch`]
+pub trait ArrowPredicate: Send + 'static {


This is to make things more extensible in the long run

tustvold · 2022-08-05T14:32:20Z

parquet/src/arrow/arrow_reader/mod.rs

@@ -349,22 +323,13 @@ impl RecordBatchReader for ParquetRecordBatchReader {
 }

 impl ParquetRecordBatchReader {
-    pub fn try_new(


This module is not public, and this method was only being used in one place, so we can just remove it

tustvold · 2022-08-05T14:34:20Z

parquet/src/arrow/arrow_reader/mod.rs

+) -> Result<RowSelection> {
+    let reader =
+        ParquetRecordBatchReader::new(batch_size, array_reader, selection.clone());
+    let mut filters = vec![];


We could theoretically keep the decoded arrays around, but requires a lot of non-trivial take + concat in order to sync up the yielded batches. It also potentially balloons the memory consumption. I decided it was not worth it

tustvold · 2022-08-05T14:34:55Z

parquet/src/arrow/arrow_reader/filter.rs

+/// Once all predicates have been evaluated, the resulting [`RowSelection`] will be
+/// used to return just the desired rows.
+///
+/// This design has a couple of implications:


This is the major change vs #2310, FYI @thinkharderdev

Nice, I like this

tustvold · 2022-08-05T14:35:35Z

parquet/src/arrow/arrow_reader/mod.rs

-        self,
-        selection: impl Into<Vec<RowSelection>>,
-    ) -> Self {
+    /// TODO: Revisit this API, as [`Self`] is provided before the file metadata is available


I intend to revisit this as part of the next (21) arrow release, I suspect we can move to a builder and deprecate the current API which is quite clunky

tustvold · 2022-08-05T14:36:18Z

parquet/src/arrow/arrow_reader/selection.rs

+    fn into(self) -> VecDeque<RowSelector> {
+        self.selectors.into()
+    }
+}


This file definitely needs some tests prior to merge. The code is largely lifted from #2201

I agree -- I didn't review the logic in detail either as I figured we are just at the "API feedback" design phase

tustvold · 2022-08-05T14:37:06Z

parquet/src/arrow/async_reader.rs

-                                vec![None; row_group_metadata.columns().len()];
-
-                            // TODO: Combine consecutive ranges
-                            let fetch_ranges = (0..column_chunks.len())


This logic is moved into InMemoryRowGroup::fetch

tustvold · 2022-08-05T16:24:10Z

FYI @thinkharderdev @alamb @crepererum @Ted-Jiang I'd appreciate any feedback you might have on this

thinkharderdev · 2022-08-05T18:38:02Z

parquet/src/arrow/async_reader.rs

+        row_group
+            .fetch(&mut self.input, meta, &projection, selection.as_ref())
+            .await?;
+
+        let reader = ParquetRecordBatchReader::new(
+            batch_size,
+            build_array_reader(self.schema.clone(), projection, &row_group)?,
+            selection,
+        );


This would require decoding the filter columns twice (or multiple times in the case where we have multiple predicates) right?

If a column appears in multiple predicates and/or the final projection, it will need to be decoded multiple times. I don't really see a way around this, keeping the data around and doing take + concat adds significant complexity, and it is unclear that it would necessarily be faster.

Eventually it might be possible to push simple predicates down to operate directly on the encoded data, which would avoid this. But that is a wee ways off 😅

Yeah, I think in many cases the difference would be negligible but in the degenerate cases (lots of predicates, filters which don't do much filtering, etc) I think it could potentially add up. The reason I worry about it in general is that we would have to rely on the engine to determine which predicates to apply and in which order. And in a situation where all we have is row group metadata we don't have a ton to go on.

I took a crack at seeing what it might look like preserving the decoded arrays and came up with tustvold#24. It certainly involves a lot of array slicing and dicing but the complexity seems manageable and would help ensure that applying filters doesn't ever come with a significant performance cost.

I think trying to eliminate redundant decoding is a good idea for the reasons that @thinkharderdev

Conveniently, it seems like nothing in the API of this PR requires decoding multiple times, so I think we could also potentially implement the 'use take rather than redundant decode' in a follow on PR as well.

In terms of "Eventually it might be possible to push simple predicates down to operate directly on the encoded data, which would avoid this." I agree it is a ways off however, I think it could fit into this API with something like adding a list of ParquetFilters to apply during the decode itself that could be efficiently implemented

/// Filter that is applied during decoding of a single column /// semantically takes the form <col> <op> <constant> struct ParquetFilter { op: ParquetOp left: ParquetConst, } enum ParquetFilterOp { EQ, NEQ } enum ParqeutConst { Int64(i64) Float(f64) ... }

Conveniently, it seems like nothing in the API of this PR requires decoding multiple times, so I think we could also potentially implement the 'use take rather than redundant decode' in a follow on PR as well.

I agree. As @tustvold points out, we need to be careful about memory overhead so maybe the best course is to go with the current approach and tackle avoiding the redundant decoding in a follow up. I prototyped the DataFusion piece (based on my other draft PR but it should be roughly similar) to see how it affected our internal benchmarks and saw a roughly 50% improvement on queries with reasonably selective predicates. We went from being mostly CPU bound with parquet decoding to being mostly IO bound which means I expect there is even more room for improvement once we are using the selection and the page index to avoid IO altogether. That's all to say I'm super excited about this work and think it will be a huge step forward!

I will create a follow up ticket to investigate this 👍

We went from being mostly CPU bound with parquet decoding to being mostly IO bound which means I expect there is even more room for improvement once we are using the selection and the page index to avoid IO altogether. That's all to say I'm super excited about this work and think it will be a huge step forward!

That is terrific news 🎉

alamb

I also really like this API -- 👍 @tustvold

thanks for the feedback @thinkharderdev . I think this design will be better due to all the community feedback ❤️

parquet/src/arrow/arrow_reader/filter.rs

parquet/src/arrow/arrow_reader/selection.rs

alamb · 2022-08-06T14:13:08Z

parquet/src/arrow/arrow_reader/selection.rs

+    fn into(self) -> VecDeque<RowSelector> {
+        self.selectors.into()
+    }
+}


I agree -- I didn't review the logic in detail either as I figured we are just at the "API feedback" design phase

alamb · 2022-08-06T14:13:59Z

parquet/src/arrow/async_reader.rs

+    /// Row group filtering is applied prior to this, and rows from skipped
+    /// row groups should not be included in the [`RowSelection`]
+    ///
+    /// TODO: Make public once stable (#1792)


Probably also a good idea to link to the docs that describe the order of filter application in decoding (RowSelection followed by RowFilter)

The doc links should do this automatically?

alamb · 2022-08-06T14:20:48Z

parquet/src/arrow/async_reader.rs

+        row_group
+            .fetch(&mut self.input, meta, &projection, selection.as_ref())
+            .await?;
+
+        let reader = ParquetRecordBatchReader::new(
+            batch_size,
+            build_array_reader(self.schema.clone(), projection, &row_group)?,
+            selection,
+        );


I think trying to eliminate redundant decoding is a good idea for the reasons that @thinkharderdev

Conveniently, it seems like nothing in the API of this PR requires decoding multiple times, so I think we could also potentially implement the 'use take rather than redundant decode' in a follow on PR as well.

In terms of "Eventually it might be possible to push simple predicates down to operate directly on the encoded data, which would avoid this." I agree it is a ways off however, I think it could fit into this API with something like adding a list of ParquetFilters to apply during the decode itself that could be efficiently implemented

/// Filter that is applied during decoding of a single column /// semantically takes the form <col> <op> <constant> struct ParquetFilter { op: ParquetOp left: ParquetConst, } enum ParquetFilterOp { EQ, NEQ } enum ParqeutConst { Int64(i64) Float(f64) ... }

Ted-Jiang · 2022-08-06T15:35:50Z

parquet/src/arrow/async_reader.rs

+    async fn read_row_group(
+        mut self,
+        row_group_idx: usize,
+        mut selection: Option<RowSelection>,


Is there a situation we need read row groups with previous row group selection? 🤔

Edit: sorry it may come from pageIndex, forgive me

Maybe we need the previous each filter rate 😂(just a idea)

previous each filter rate

I'm not sure what you mean?

Ted-Jiang · 2022-08-06T16:00:56Z

parquet/src/arrow/async_reader.rs

+        projection: &ProjectionMask,
+        _selection: Option<&RowSelection>,
+    ) -> Result<()> {
+        // TODO: Use OffsetIndex and selection to prune pages


👍 this avoid huge IO work in some situation make pageIndex more useful !
I think it needs takes a lot of testing to decide when use random skip reads

parquet/src/arrow/arrow_reader/mod.rs

tustvold · 2022-08-09T16:48:56Z

I think this is now good for review, it isn't public (yet) and so doesn't need to be perfect, but I think we can continue to iterate on this. I will file follow up tickets for follow on work tomorrow.

yordan-pavlov · 2022-08-09T21:36:08Z

parquet/src/arrow/arrow_reader/filter.rs

+    /// with `true` values in the returned [`BooleanArray`] indicating rows
+    /// matching the predicate.
+    ///
+    /// All row that are `true` in returned [`BooleanArray`] will be returned to the reader.


minor typo: All row is probably meant to be All rows (missing "s")

yordan-pavlov · 2022-08-09T21:40:58Z

parquet/src/arrow/arrow_reader/filter.rs

+    ///
+    /// All row that are `true` in returned [`BooleanArray`] will be returned to the reader.
+    /// Any rows that are `false` or `Null` will not be
+    fn filter(&mut self, batch: RecordBatch) -> ArrowResult<BooleanArray>;


should this be named fn filter_array instead to better indicate that the result would be a Boolean filter array instead of actually filtering the RecordBatch passed in as the batch parameter?

I think both names are kind of confusing tbh, I'll rename it to evaluate as I think that should be clear

alamb

I think this PR looks good to go to me so we can start hooking it up and getting everything ready.

I definitely have some questions about the and test -- but since this code isn't used yet I don't think it would block merge.

@Ted-Jiang or @thinkharderdev do you have any other comments or suggestions? We can also address any additional changes in a follow on PR as this one is already fairly large

alamb · 2022-08-11T15:02:57Z

parquet/src/arrow/arrow_reader/selection.rs

+            RowSelector::skip(4),
+        ]);
+
+        let mut expected = RowSelection::from(vec![


The expected answer doesn't make sense to me. When I did it out there seems to be something wrong

N = Skip
Y = Select

a: NNNNNNNNNNNNYYYYYYYYYYYYYYYYYYYYYYNNNYYYYY b: YYYYYNNNNYYYYYYYYYYYYYYYNNN What is here: e: NNNNNNNNNNNNYYYYYNNNNYYYYYYYYYYYYYYNNYNNNN What I think the answer should be: e: NNNNNNNNNNNNYYYYYYYYYYYYNNNYYYYYYYNNNYYYYY

Though to be honest I am not sure sure what an AND and nulls should be

I am probably missing something obvious here

That looks right to me. The and should just be giving you the result of applying the filters sequentially. Visually, it makes sense (to me) like

a: NNNNNNNNNNNNYYYYYYYYYYYYYYYYYYYYYYNNNYYYYY b: YYYYYNNNNYYYYYYYYYYYYY YYNNN NNNNNNNNNNNNYYYYYNNNNYYYYYYYYYYYYYYNNYNNNN

Perhaps and is the wrong name for this function? I will incorporate @thinkharderdev 's example as a comment as I think it is helpful 👍

FWIW I think and is a little confusing since we're dealing with a boolean array and it is not clear from the name that this is essentially a composition operator. Maybe and_then?

thinkharderdev · 2022-08-11T15:19:38Z

I think this PR looks good to go to me so we can start hooking it up and getting everything ready.

I definitely have some questions about the and test -- but since this code isn't used yet I don't think it would block merge.

@Ted-Jiang or @thinkharderdev do you have any other comments or suggestions? We can also address any additional changes in a follow on PR as this one is already fairly large

I think this should is good to merge.

ursabot · 2022-08-11T20:03:22Z

Benchmark runs are scheduled for baseline = 4481993 and contender = 21ba02e. 21ba02e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

alamb

nice 👌

alamb · 2022-08-12T17:07:04Z

parquet/src/arrow/arrow_reader/selection.rs

+    }
+
+    /// Given a [`RowSelection`] computed under `self`, returns the [`RowSelection`]
+    /// representing their conjunction


Yeah, I think the wording saying this is a conjunction is misleading, it is more like the conjunction of the subsequent filters only with the rows that were selected previously

Add RowFilter API

43b96aa

github-actions bot added the parquet Changes to the parquet crate label Aug 5, 2022

tustvold mentioned this pull request Aug 5, 2022

Add filter pushdown example (#1191) #2201

Closed

tustvold commented Aug 5, 2022

View reviewed changes

thinkharderdev reviewed Aug 5, 2022

View reviewed changes

alamb changed the title ~~RFC: Add RowFilter API~~ RFC: Add Parquet RowFilter API Aug 6, 2022

alamb reviewed Aug 6, 2022

View reviewed changes

Ted-Jiang reviewed Aug 6, 2022

View reviewed changes

parquet/src/arrow/arrow_reader/mod.rs Outdated Show resolved Hide resolved

tustvold added 6 commits August 8, 2022 11:37

Review feedback

7037126

Merge remote-tracking branch 'upstream/master' into row-filter

fd73b2c

Merge remote-tracking branch 'upstream/master' into row-filter

8a70d24

Fix doc

ebd85b3

Fix handling of NULL boolean array

3f3e200

Add tests, fix bugs

6f196c0

tustvold changed the title ~~RFC: Add Parquet RowFilter API~~ Add Parquet RowFilter API Aug 9, 2022

tustvold marked this pull request as ready for review August 9, 2022 16:47

Fix clippy

623587d

yordan-pavlov reviewed Aug 9, 2022

View reviewed changes

alamb approved these changes Aug 11, 2022

View reviewed changes

tustvold added 3 commits August 11, 2022 20:23

Merge remote-tracking branch 'upstream/master' into row-filter

16c0c0e

Review feedback

8760447

Fix doc

acb44f8

tustvold merged commit 21ba02e into apache:master Aug 11, 2022

alamb reviewed Aug 12, 2022

View reviewed changes

alamb mentioned this pull request Sep 13, 2022

[EPIC] Parquet filter pushdown into scan apache/datafusion#3462

Open

27 tasks

Add Parquet RowFilter API #2335

Add Parquet RowFilter API #2335

Conversation

tustvold commented Aug 5, 2022 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Aug 5, 2022

Choose a reason for hiding this comment

tustvold Aug 6, 2022 • edited Loading

Choose a reason for hiding this comment

thinkharderdev Aug 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ted-Jiang Aug 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Aug 9, 2022

Choose a reason for hiding this comment

yordan-pavlov Aug 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thinkharderdev commented Aug 11, 2022

ursabot commented Aug 11, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Aug 5, 2022 •

edited by alamb

Loading

tustvold Aug 6, 2022 •

edited

Loading

thinkharderdev Aug 6, 2022 •

edited

Loading

Ted-Jiang Aug 6, 2022 •

edited

Loading

yordan-pavlov Aug 9, 2022 •

edited

Loading