-
Notifications
You must be signed in to change notification settings - Fork 1.5k
WIP: Test DataFusion with experimental IncrementalRecordBatchBuilder #16208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🤖 |
🤖: Benchmark completed Details
|
Hmm that is somewhat depressing. I will investigate tomorrow |
Interesting, i am confused why it's different from arrow-rs clickbench result. |
Ok, i misunderstood this testing is not for parquet filter pushdown enabled... This is Does using IncremntalRecordBatchBuilder in Filter help |
.output_batch_builder | ||
.as_mut() | ||
.expect("output_batch_builder should be Some"); | ||
Ok(output_batch_builder.append_filtered(batch, filter)?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect it should be faster rather than incrementally build a new batch based on a number of arrays, to first evaluate a number of filters (self.predicate.evaluate(&batch)
until the number of true values reaches the target batch size and then have a filter api to filter a list of batches. This avoids reallocations / copying as the target capacity can be calculated.
In order to avoid buffering too much batches probably have to limit this / create a batch anyway after x batches or having x megabytes in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My feeling is this should to do this "multi batch filter" and then concat
anyway if smaller batches are generated by this approach, rather than using a builder approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This avoids reallocations / copying as the target capacity can be calculated.
In order to avoid buffering too much batches probably have to limit this / create a batch anyway after x batches or having x megabytes in memory.
I was trying to avoid having any reallocations in the IncrementalRecordBatchBuilder
-- since we know the target output batch size (batch_size
) it knows how much space each batch will take up front and can just straight up allocate it ( instantiate_builder
function creates the builders with with_capacity
)
However, now that I think about it, after a call to finish()
the updated builder doesn't have the right allocation 🤔
I'll look into that more later today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space each batch will take up front and can just straight up allocate
Hm yeah with primitive and views this might be okay - we might have to test other data sources though (with normal binary types were it can't be preallocated).
I ran q24 locally and did see a small slowdown and did some profiling As expected filtering is about 30% of the overall execution time of the filtering time, about 1/2 goes to creating the output The analysis reveals some more places to potentially improve (also aligned with @Dandandan 's suggestions): There is also some evidence of reallocation as @Dandandan mentions: #16208 (comment) Next steps:
Details
On this branch: alamb/test_filter_pushdown I do so ./datafusion-cli-alamb_test_filter_pushdown
-f q24-many.sql | grep Elapsed
Elapsed 0.258 seconds.
Elapsed 0.237 seconds.
Elapsed 0.220 seconds.
Elapsed 0.213 seconds.
Elapsed 0.223 seconds.
Elapsed 0.225 seconds.
Elapsed 0.223 seconds.
Elapsed 0.219 seconds.
Elapsed 0.223 seconds. On main main datafusion-cli -f q24-many.sql | grep Elapsed
Elapsed 0.220 seconds.
Elapsed 0.217 seconds.
Elapsed 0.203 seconds.
Elapsed 0.211 seconds.
Elapsed 0.216 seconds.
Elapsed 0.197 seconds.
Elapsed 0.203 seconds.
Elapsed 0.199 seconds.
Elapsed 0.218 seconds. |
Closing in favor of #16249 |
This PR is for testing DataFusion with the code in the following PR
I want to run 2 experiments:
IncremntalRecordBatchBuilder
in Filter helpClickBench
performance better with filter_pushdown enabled (WIP: Test DataFusion with experimental Parquet Filter Pushdown #16222)If I can see significant improvements I will proceed to making apache/arrow-rs#7513 real