-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Recently two new types were added to the Arrow format that make it more suitable for certain types of operations on strings
Specifically when doing filtering / take with string data, creating a new Utf8Array
requires copying the strings to a new, packed binary buffer. The "VariableSizeBinaryView" was designed to solve this limitation and recently added to the Arrow spec.
Describe the solution you'd like
I would like to implement StringViewArray
and BinaryViewArray
following the spec:
The spec: https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout
https://github.com/apache/arrow/blob/3fe598ae4dfd7805ab05452dd5ed4b0d6c97d8d5/format/Schema.fbs#L187-L205
Initially, I would suggest we get the basic types in place:
- Add
DataType::Utf8View
andDataType::BinaryView
#5468 - Add
StringViewArray
implementation and layout and basic construction + tests #5469 - Add
BinaryViewArray
implementation and layout and basic construction
Then as follow on PRs, add support for key features
- IPC format support for
StringViewArray
andBinaryViewArray
#5506 - Arrow Flight format support for
StringViewArray
andBinaryViewArray
#5507 -
cast
kernel support forStringViewArray
andBinaryViewArray
#5508 - Display support for
StringViewArray
andBinaryViewArray
#5509 -
filter
kernel support forStringViewArray
andBinaryViewArray
#5510 -
take
kernel support forStringViewArray
andBinaryViewArray
#5511 -
cast
kernel support forStringViewArray
andBinaryViewArray
<-->
DictionaryArray` #5861 - Implement
compare_op
forGenericBinaryView
#5897 - Implement arrow-row en/decoding for GenericByteView types #5921
-
like
benchmark for StringView #5936 - Implement sort for String/BinaryViewArray #5963 (review)
- Implement like/ilike etc for StringViewArray #5931
- Implement benchmarks for
compare_op
forGenericBinaryView
#5903 - Add
gc
garbage collector support forStringViewArray
andBinaryViewArray
#5513 - Implement
StringViewArray
andBinaryViewArray
reading/writing in parquet #5530 - Add arrow IPC cross implementation serialization tests
- Potential performance improvements for reading Parquet to StringViewArray/BinaryViewArray #5904
- Consider implementing some sort of
deduplicate
/intern
functionality for StringView #5910 - New null with view types are not supported #5893
- Min/max support for String/BinaryViewArray #6052
- Default block_size for
StringViewArray
#6094 - Faster min/max for string/binary view arrays #6088
-
interleave
(used in Sort)
Potential Follow ons / additional optimizations
- Optimize StringView row decoding #5945
- Improve performance of constructing
ByteView
s for small strings #6034 - Improve arrow-row --> StringView/BinaryView memory usage #6057
- Improve speed of row converter by skipping utf8 checks #6058
- Optimize
like
/ilike
kernels for StringView #5951
Describe alternatives you've considered
I think a good plan would be to dust off the prototype on #4585 from @tustvold (linked from #4253).
Initially, the idea would be to dust off the PR and split it into a few smaller PRs with tests and docs.
Additional context
Polars implemented it recently in rust so that can serve as a motivation
Blog Post https://pola.rs/posts/polars-string-type/
https://twitter.com/RitchieVink/status/1749466861069115790
Facebook/Velox's take: https://engineering.fb.com/2024/02/20/developer-tools/velox-apache-arrow-15-composable-data-management/
Related PRs:
pola-rs/polars#13748
pola-rs/polars#13839
pola-rs/polars#13489