-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge?
Part of #10918
In order to take advantage of the parquet writer generating StringViewArrays ( apache/arrow-rs#5530 from @ariesdevil (❤️ ) ) we need to make sure datafusion doesn't immediately cast the array back to StringView
which would undo the benefits
▲
┌ ─ ─ ─ ─ ─ ─ ┐ │ After filtering,
StringArray │ any unfiltered rows
└ ─ ─ ─ ─ ─ ─ ┘ │ are gathered via
... │ the `take` kernel
│
┌────────────────────────────┐
│ │
│ FilterExec │
│ │
└────────────────────────────┘
▲
┌ ─ ─ ─ ─ ─ ─ ┐ │
StringArray │
└ ─ ─ ─ ─ ─ ─ ┘ │ Reading String data
│ from a Parquet file
... │ results in
│ StringArrays passed
┌ ─ ─ ─ ─ ─ ─ ┐ │
StringArray │
└ ─ ─ ─ ─ ─ ─ ┘ │
│
┌────────────────────────────┐
│ │
│ ParquetExec │
│ │
└────────────────────────────┘
Current situation
Describe the solution you'd like
To support a phased rollout of this feature, I recommend we focus at first on only the first filtering operation
Specifically get to the point where the parquet reader will read data out as StringView like this:
▲
┌ ─ ─ ─ ─ ─ ─ ┐ │
StringArray │
└ ─ ─ ─ ─ ─ ─ ┘ │
... │
│
┌────────────────────────────┐
│ │
│ FilterExec │
│ │
└────────────────────────────┘
┌ ─ ─ ─ ─ ─ ─ ┐ ▲
StringViewArr │
│ ay │ │
─ ─ ─ ─ ─ ─ ─ │
... │
│
┌ ─ ─ ─ ─ ─ ─ ┐ │
StringViewArr │
│ ay │ │
─ ─ ─ ─ ─ ─ ─ │
│
┌────────────────────────────┐
│ │
│ ParquetExec │
│ │
└────────────────────────────┘
Intermediate
Situation 1: pass
StringViewArray
between ParquetExec
Describe alternatives you've considered
I suggest we:
- Make a configuration setting like "force StringViewArray" when reading parquet so we can test this. When this setting is enabled, DataFusion should configure the ParquetExec to produce
StringViewArray
regardless of the type stored in the parquet file - Then work on incrementally rolling out support / testing for various filter expressions (especially string functions like substring and Implement equality
=
and inequality<>
support forStringView
#10919)
Additional context
No response
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request