Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider implementing some sort of deduplicate / intern functionality for StringView #5910

Closed
Tracked by #5374
alamb opened this issue Jun 18, 2024 · 4 comments · Fixed by #6005
Closed
Tracked by #5374

Consider implementing some sort of deduplicate / intern functionality for StringView #5910

alamb opened this issue Jun 18, 2024 · 4 comments · Fixed by #6005
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Jun 18, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Part of implementing StringView #5374

@XiangpengHao implemented gc which compacts all the strings in a StringView/BinaryView into contiguous storage in #5513

However, that functionality does not deduplicate/intern the strings -- it just copies them over

Describe the solution you'd like

We should make it easy to deduplicate the strings in a StringView.

I do think we should change gc to do deduplication without an explict as (as deduplication is expensive)

Describe alternatives you've considered

  1. Do nothing (users can implement their own version of this code without any addtional apis)
  2. Add a new function (e.g. GenericBinaryView::dedupe) that deduplicated such arrays (likely not moving any strings, but just updating views)
  3. Add an argument to GenericBinaryView::gc that controlled the behavior (as in could also specify doing gc)

Additional context
@alexwilcoxson-rel asked in #5904 (comment)

Can/will this incorporate deduping/interning/implicitly using the gc function that landed recently?

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Jun 18, 2024
@alamb
Copy link
Contributor Author

alamb commented Jun 18, 2024

BTW the https://docs.rs/arrow/latest/arrow/array/type.StringDictionaryBuilder.html structure has code to do the deduplication quickly

So one way to implement a combination of gc and deduplication would be to create a DictionaryArray with a GenericByteDictionaryBuilder and then cast back to StringViewArray

With the code for fast DictionaryArray --> StringViewArray added in #5861, this would only copy the strings once (though it would build up intermediate indexes that maybe could be avoided with a direct approach)

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 2, 2024

BTW the https://docs.rs/arrow/latest/arrow/array/type.StringDictionaryBuilder.html structure has code to do the deduplication quickly

So one way to implement a combination of gc and deduplication would be to create a DictionaryArray with a GenericByteDictionaryBuilder and then cast back to StringViewArray

With the code for fast DictionaryArray --> StringViewArray added in #5861, this would only copy the strings once (though it would build up intermediate indexes that maybe could be avoided with a direct approach)

The idea is quite similar to what we have in ArrowBytesMap in datafusion

Check hash before Insert https://github.com/apache/datafusion/blob/48a1754b33e983a8201ca3fefa36136fa44f0c55/datafusion/physical-expr-common/src/binary_map.rs#L332

Ideally, if we have well-supported StringViewArray, we don't need specialized SSO ArrowBytesMap but convert any kind of arrays (including StringViewArray) to optimized arrow Row for group by 🤔

@alamb
Copy link
Contributor Author

alamb commented Jul 2, 2024

Ideally, if we have well-supported StringViewArray, we don't need specialized SSO ArrowBytesMap but convert any kind of arrays (including StringViewArray) to optimized arrow Row for group by 🤔

One difference is that ArrowBytesMap forms the buffer for a StringArray (including strings shorter than 12 bytes) into a single buffer. So if we changed the rest of the plan to use StringViewArray switching to use StringViewArray in grouping makes sense. However, if we had to convert from StringViewArray --> StringArray that conversion is potentially slow as it has to copy the strings around

@alamb
Copy link
Contributor Author

alamb commented Jul 24, 2024

label_issue.py automatically added labels {'arrow'} from #6005

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants