Skip to content

Convert Utf8View/BinaryView --> Utf8 / Binary at output #12119

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

Part of #11752

We are trying to change DataFusion to use StringViewArray by default when reading parquet (and, for example, when it makes more sense such as the substr function), StringView enables many interesting optimization opportunities. However, as StringView is still being adopted across the rest of the arrow ecosystem, if DataFusion begins to emit StringViewArray in some places, it may cause issues with other parts of the ecosystem (e.g. flight clients may not be able to interpret data sent by a server using DataFusion)

Describe the solution you'd like

I would like DataFusion to retain maximum compatibility at the interfaces, but be able to use StringViewArray internally when it improves performance

Describe alternatives you've considered

I recommend a config flag that makes it possible to convert Utf8View/BinaryView --> Utf8 / Binary at the query output and I think this conversion should be done by default.

For example we might add this configuration flag:

datafusion.optimizer.expand_views_at_output=true

If this flag is true,

  1. add code in the Analyzer (maybe in the TypeCOercion code)
  2. check the output columns of a plan, and if any are DataType::Utf8View or DataType::BinaryView, add ProjectionExecthat converts them to Utf8/Binary (by adding a cast toDataType::Utf8orDataType::Binary` respectively

Additional context

We already have to do something similar in flight with dictionary arrays

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions