-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Is your feature request related to a problem or challenge?
Part of #11752
We are trying to change DataFusion to use StringViewArray by default when reading parquet (and, for example, when it makes more sense such as the substr
function), StringView enables many interesting optimization opportunities. However, as StringView is still being adopted across the rest of the arrow ecosystem, if DataFusion begins to emit StringViewArray
in some places, it may cause issues with other parts of the ecosystem (e.g. flight clients may not be able to interpret data sent by a server using DataFusion)
Describe the solution you'd like
I would like DataFusion to retain maximum compatibility at the interfaces, but be able to use StringViewArray internally when it improves performance
Describe alternatives you've considered
I recommend a config flag that makes it possible to convert Utf8View
/BinaryView
--> Utf8
/ Binary
at the query output and I think this conversion should be done by default.
For example we might add this configuration flag:
datafusion.optimizer.expand_views_at_output=true
If this flag is true,
- add code in the Analyzer (maybe in the TypeCOercion code)
- check the output columns of a plan, and if any are
DataType::Utf8View
orDataType::BinaryView
, add ProjectionExecthat converts them to Utf8/Binary (by adding a cast to
DataType::Utf8or
DataType::Binary` respectively
Additional context
We already have to do something similar in flight with dictionary arrays