Skip to content

use StringViewArray when reading String columns from Parquet #10921

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

Part of #10918

In order to take advantage of the parquet writer generating StringViewArrays ( apache/arrow-rs#5530 from @ariesdevil (❤️ ) ) we need to make sure datafusion doesn't immediately cast the array back to StringView which would undo the benefits

                ▲                       
┌ ─ ─ ─ ─ ─ ─ ┐ │   After filtering,    
  StringArray   │   any unfiltered rows 
└ ─ ─ ─ ─ ─ ─ ┘ │   are gathered via    
      ...       │   the `take` kernel   
                │                       
 ┌────────────────────────────┐         
 │                            │         
 │         FilterExec         │         
 │                            │         
 └────────────────────────────┘         
                ▲                       
┌ ─ ─ ─ ─ ─ ─ ┐ │                       
  StringArray   │                       
└ ─ ─ ─ ─ ─ ─ ┘ │   Reading String data 
                │   from a Parquet file 
      ...       │   results in          
                │   StringArrays passed 
┌ ─ ─ ─ ─ ─ ─ ┐ │                       
  StringArray   │                       
└ ─ ─ ─ ─ ─ ─ ┘ │                       
                │                       
 ┌────────────────────────────┐         
 │                            │         
 │        ParquetExec         │         
 │                            │         
 └────────────────────────────┘         
                                        
                                        
                                        
      Current situation                 

Describe the solution you'd like

To support a phased rollout of this feature, I recommend we focus at first on only the first filtering operation

Specifically get to the point where the parquet reader will read data out as StringView like this:

                ▲              
┌ ─ ─ ─ ─ ─ ─ ┐ │              
  StringArray   │              
└ ─ ─ ─ ─ ─ ─ ┘ │              
      ...       │              
                │              
 ┌────────────────────────────┐
 │                            │
 │         FilterExec         │
 │                            │
 └────────────────────────────┘
┌ ─ ─ ─ ─ ─ ─ ┐ ▲              
 StringViewArr  │              
│     ay      │ │              
 ─ ─ ─ ─ ─ ─ ─  │              
      ...       │              
                │              
┌ ─ ─ ─ ─ ─ ─ ┐ │              
 StringViewArr  │              
│     ay      │ │              
 ─ ─ ─ ─ ─ ─ ─  │              
                │              
 ┌────────────────────────────┐
 │                            │
 │        ParquetExec         │
 │                            │
 └────────────────────────────┘
                               
                               
                               
      Intermediate             
      Situation 1: pass        
      StringViewArray          
      between ParquetExec      

Describe alternatives you've considered

I suggest we:

  1. Make a configuration setting like "force StringViewArray" when reading parquet so we can test this. When this setting is enabled, DataFusion should configure the ParquetExec to produce StringViewArray regardless of the type stored in the parquet file
  2. Then work on incrementally rolling out support / testing for various filter expressions (especially string functions like substring and Implement equality = and inequality <> support for StringView #10919)

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions