-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
While working on enabling StringView
by default in #12092 I noticed that some of the clickbench queries got 10% slower and looked into it.
The plan looks like this:
DataFusion CLI v41.0.0
+---------------+----------------------------------------------------------------------------------------------------$
| plan_type | plan $
+---------------+----------------------------------------------------------------------------------------------------$
| physical_plan | AggregateExec: mode=Final, gby=[], aggr=[count(*)] $
| | CoalescePartitionsExec $
| | AggregateExec: mode=Partial, gby=[], aggr=[count(*)] $
| | ProjectionExec: expr=[] $
| | CoalesceBatchesExec: target_batch_size=8192 $
| | FilterExec: CAST(URL@0 AS Utf8View) LIKE %google% $
| | ParquetExec: file_groups={16 groups: [[Users/andrewlamb/Software/datafusion/benchmarks/$
+---------------+----------------------------------------------------------------------------------------------------$
2 row(s) fetched.
Elapsed 0.065 seconds.
When looking at the flamegraphs, you can see the CAST
spends a huge amount of time validating utf8 (more time than actually evaluating the LIKE
predicate actually):
Here are the full flamegraphs for comparison:
q20-flamegraph-main
q20-flamegraph-stringview
I belive the issue is here:
| | FilterExec: CAST(URL@0 AS Utf8View) LIKE %google%
This filter first *CASTs the URL column to Utf8View and then evaluates
LIKE`
Converting BinaryArray
--> StringArray
as is done without StringView is relatively faster because it is done with a single large function call
However, converting BinaryViewArrar
--> StringViewArray
is not as it makes many small function calls. The parquet reader has a special optimization for this as descsribed in "Section 2.1: From binary to strings" of the Using StringView / German Style Strings to Make Queries Faster: Part 1 - Reading Parquet from @XiangpengHao
Describe the solution you'd like
I would like this query to go as fast / faster with Utf8View / BinaryView enabled.
Bonus points if it went faster even without Utf-8 enabled
Describe alternatives you've considered
Option 1: LIKE
for binary
One option is to skip validating UTF8 entirely and evaluate LIKE
directly on binary. This would mean if the column is read as binary we could cast the argument '%google%'
to binary and then evaluate LIKE
directly on the binary column. This would skip validaitng utf8 completely
Unfortunately, it appears that the like
kernel is only implemented for StringArray
and StringViewArray
at the moment, not BinaryArray: https://docs.rs/arrow-string/53.0.0/src/arrow_string/like.rs.html#110-149
Another related option would be to potentially special case the LIKE
rewite in this case for just prefix / contians / suffix -- in this case rewrite <binary> LIKE <const that starts and ends with '%'
> --> <binary> CONTAINS <string>
Option 2: resolve the column as Utf8
rather than Binary
For some reason the schema of hits.parquet
(the single file from ClickBench) has the URL
column (and others) as Utf8
(strings) but the hits_partitioned
file resolves it as Binary. \
We could change the schema resolution logicic to resolve the column as a String instead.
This option is probably slower than option 1 but I think it is more inline with what the intended semantics (these columns contain logical stirngs) and the parquet reader includes the fast read path for such strings and would be more general.
Filed #12510 to track this ideae
Additional context