Skip to content

[Epic] Implement support for StringView in DataFusion #10918

Closed
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

StringView / BinaryView were added to the Arrow format that make it more suitable for certain types of operations on strings. Specifically when filtering with string data, creating the output StringArray requires copying the strings to a new, packed binary buffer.

GenericByteViewArrary was designed to solve this limitation and the arrow-rs implementation, tracked by apache/arrow-rs#5374, is now complete enough to start adding support into DataFusion

I think we can improve performance in certain cases by using StringView (this is also described in more details in the Pola.rs blog post)

  1. Reading strings / binary from Parquet files as StringViewArray/BinaryViewArray rather than StringArray / BinaryArray saves a copy (and @ariesdevil is quite close to having it integrated into the parquet reader Implement StringViewArray and BinaryViewArray reading/writing in parquet arrow-rs#5530)
  2. Evaluating predicates on string expressions (for example substr(url, 4) = 'http') as the intermediate result of substr can be called without copying string values

Describe the solution you'd like

I would like to support StringView / BinaryView support in DataFusion.

While my primary usecase is for reading data from parquet, I think teaching DataFusion to use StringView (at least as intermediate values when evaluating expressions may help significantly

Development branch: string-view

Since this feature requires upstream arrow-rs support apache/arrow-rs#5374 that is not yet released we plan to do development on a string-view feature branch:

https://github.com/apache/datafusion/tree/string-view

Task List

Here are some high level tasks (I can help flesh these out for anyone who is interested in helping)

Describe alternatives you've considered

No response

Additional context

Polars implemented it recently in rust so that can serve as a motivation
Blog Post https://pola.rs/posts/polars-string-type/
https://twitter.com/RitchieVink/status/1749466861069115790

Facebook/Velox's take: https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/

Related PRs:
pola-rs/polars#13748
pola-rs/polars#13839
pola-rs/polars#13489

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions