Improve coalesce and concat performance for views#7614
Improve coalesce and concat performance for views#7614Dandandan merged 20 commits intoapache:mainfrom
coalesce and concat performance for views#7614Conversation
9ff57e5 to
b3d4337
Compare
b3d4337 to
c3648f4
Compare
gc_string_view_batch performance
gc_string_view_batch performancecoalesce performance
coalesce performancecoalesce and concat performance for views
|
@alamb FYI |
There was a problem hiding this comment.
Pull Request Overview
This PR targets performance improvements for view concatenation and coalescing in Arrow by specializing functions and reducing unnecessary memory copies. Key changes include:
- Adding specialized concatenation functions (concat_byte_view and concat_string_view) for view types.
- Refactoring gc_string_view_batch to take ownership of RecordBatches to avoid redundant cloning.
- Updating the GenericByteViewBuilder implementation to use a vector-based buffer instead of a BufferBuilder for better performance and clarity.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| arrow-select/src/concat.rs | Introduces optimized concat functions for Utf8View and BinaryView. |
| arrow-select/src/coalesce.rs | Refactors gc_string_view_batch for improved efficiency and updates associated tests. |
| arrow-array/src/builder/generic_bytes_view_builder.rs | Replaces BufferBuilder with Vec for views and updates related debug output. |
Comments suppressed due to low confidence (3)
arrow-select/src/concat.rs:89
- Consider adding inline documentation for the concat_byte_view function to explain its purpose, expected input, and output behavior.
fn concat_byte_view(arrays: &[&dyn Array]) -> Result<ArrayRef, ArrowError> {
arrow-select/src/coalesce.rs:305
- Add a comment to explain the safety guarantees around using the unsafe StringViewArray::new_unchecked call and why it is acceptable in this context.
let gc_string = unsafe {
arrow-array/src/builder/generic_bytes_view_builder.rs:449
- Update the debug field label from "views_builder" to "views_buffer" so that the debug output accurately reflects the name of the underlying variable.
.field("views_builder", &self.views_buffer)
|
LOVE IT! |
|
🤖 |
alamb
left a comment
There was a problem hiding this comment.
This is a thing of Beauty @Dandandan
I kicked off some other benchmarks just to be sure and left some small potential improvements, but I also think this is better than what is currently on main so we could merge it as is and keep improving as a follow on
| /// using [`GenericByteViewBuilder::try_append_view`] | ||
| pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> { | ||
| views_builder: BufferBuilder<u128>, | ||
| views_buffer: Vec<u128>, |
| self.flush_in_progress(); | ||
| self.completed.extend(array.data_buffers().iter().cloned()); | ||
|
|
||
| if self.completed.is_empty() { |
There was a problem hiding this comment.
This checks completed.is_empty after pushing new data buffers, which I think means the fast path will never be taken. I think the check could be done prior to calling self.completed.extend and improve performance
There was a problem hiding this comment.
I couldn't resist and tried the following change locally and saw some additional improvements in concat but also some slowdowns with entirely inlined views, which I don't understand 🤔
diff --git a/arrow-array/src/builder/generic_bytes_view_builder.rs b/arrow-array/src/builder/generic_bytes_view_builder.rs
index 1e06cd7698..3936ca06d4 100644
--- a/arrow-array/src/builder/generic_bytes_view_builder.rs
+++ b/arrow-array/src/builder/generic_bytes_view_builder.rs
@@ -121,7 +121,7 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
/// growing buffer size exponentially from 8KB up to 2MB. The
/// first buffer allocated is 8KB, then 16KB, then 32KB, etc up to 2MB.
///
- /// If this method is used, any new buffers allocated are
+ /// If this method is used, any new buffers allocated are
/// exactly this size. This can be useful for advanced users
/// that want to control the memory usage and buffer count.
///
@@ -210,9 +210,11 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
/// and add the (adapted) views.
pub fn append_array(&mut self, array: &GenericByteViewArray<T>) {
self.flush_in_progress();
+ let update_views = !(self.completed.is_empty() || array.data_buffers().is_empty());
+
self.completed.extend(array.data_buffers().iter().cloned());
- if self.completed.is_empty() {
+ if update_views {
self.views_buffer.extend_from_slice(array.views());
} else {
let starting_buffer = self.completed.len() as u32;group coalesce_improvements coalesce_improvements_fix
----- --------------------- -------------------------
concat utf8_view max_str_len=128 null_density=0 1.53 80.4±0.42µs ? ?/sec 1.00 52.5±2.85µs ? ?/sec
concat utf8_view max_str_len=128 null_density=0.2 1.51 86.4±0.34µs ? ?/sec 1.00 57.3±2.97µs ? ?/sec
concat utf8_view max_str_len=20 null_density=0 1.58 80.2±0.36µs ? ?/sec 1.00 50.6±2.33µs ? ?/sec
concat utf8_view max_str_len=20 null_density=0.2 1.65 98.9±0.36µs ? ?/sec 1.00 59.9±2.19µs ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0 1.00 40.6±2.82µs ? ?/sec 1.94 78.7±1.18µs ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0.2 1.00 53.9±2.78µs ? ?/sec 1.95 105.1±0.26µs ? ?/sec
There was a problem hiding this comment.
Shouldn't this be:
@@ -210,9 +210,11 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
/// and add the (adapted) views.
pub fn append_array(&mut self, array: &GenericByteViewArray<T>) {
self.flush_in_progress();
+ let keep_views = self.completed.is_empty() || array.data_buffers().is_empty();
+
self.completed.extend(array.data_buffers().iter().cloned());
- if self.completed.is_empty() {
+ if keep_views {
self.views_buffer.extend_from_slice(array.views());
} else {
let starting_buffer = self.completed.len() as u32;There was a problem hiding this comment.
For me this is a all-positive change 🎉 I really like the >2x improvement on inlined views.
concat utf8_view all_inline max_str_len=12 null_density=0
time: [25.775 µs 27.046 µs 28.546 µs]
change: [−55.148% −53.600% −51.725%] (p = 0.00 < 0.05)
Performance has improved.
Found 23 outliers among 100 measurements (23.00%)
7 (7.00%) low mild
5 (5.00%) high mild
11 (11.00%) high severe
concat utf8_view max_str_len=20 null_density=0
time: [53.475 µs 53.934 µs 54.689 µs]
change: [−7.1979% −6.4937% −5.5943%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
1 (1.00%) low mild
3 (3.00%) high mild
3 (3.00%) high severe
concat utf8_view max_str_len=128 null_density=0
time: [52.999 µs 53.106 µs 53.245 µs]
change: [−7.0948% −6.6147% −6.0319%] (p = 0.00 < 0.05)
Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
7 (7.00%) high mild
8 (8.00%) high severe
concat utf8_view all_inline max_str_len=12 null_density=0.2
time: [25.532 µs 25.741 µs 25.943 µs]
change: [−56.519% −56.108% −55.710%] (p = 0.00 < 0.05)
Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
8 (8.00%) low mild
3 (3.00%) high mild
concat utf8_view max_str_len=20 null_density=0.2
time: [56.062 µs 56.342 µs 56.688 µs]
change: [−6.1050% −5.3513% −4.6591%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) low mild
4 (4.00%) high severe
concat utf8_view max_str_len=128 null_density=0.2
time: [55.406 µs 55.623 µs 55.891 µs]
change: [−6.0536% −5.4045% −4.8059%] (p = 0.00 < 0.05)
Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
1 (1.00%) low mild
1 (1.00%) high mild
12 (12.00%) high severe
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
🤔 seems there is no |
|
I also ran this PR with a new concat benchmark in And it also shows pretty sweet gains |
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close? - Related to #7614 # Rationale for this change We expect #7614 to improve concat performance, but we don't have any concat benchmarks for StringViewArray Let's fix that # What changes are included in this PR? 1. Add a `concat` benchmark for StringViewArray # Are there any user-facing changes? No this is an internal benchmarking tool only
|
🤖 |
alamb
left a comment
There was a problem hiding this comment.
Looking great -- I kicked off some more tests
| .iter() | ||
| fn gc_string_view_batch(batch: RecordBatch) -> RecordBatch { | ||
| let (schema, columns, num_rows) = batch.into_parts(); | ||
| let new_columns: Vec<ArrayRef> = columns |
There was a problem hiding this comment.
I don't know if it matters but in #7620 I simply modified the existing Vec rather than collecting into a new one to save maybe an allocation
for mut c in columns.iter_mut() {
...
}There was a problem hiding this comment.
AFAIK there is a Rust compiler optimization that will reuse a Vec allocation if using into_iter / collect (on owned Vec) 🤔
There was a problem hiding this comment.
This is one of the reasons I think using Vec arrow-rs makes sense, there are many optimizations that are not feasible to implement in arrow-rs.
There was a problem hiding this comment.
Yeah I think using Vec is a great idea - it is both simpler and as can be seen by this PR often quite a bit faster
| }) | ||
| .collect(); | ||
|
|
||
| let buffers = if buffer.is_empty() { |
There was a problem hiding this comment.
I think the only way this could happen is if ideal_buffer_size is zero, right? In that case we could make a more efficient inner loop that simply copied the views without checking length, etc
if ideal_buffer_size == 0 {
// just reuse views as is...
}|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
| let views: Vec<u128> = s | ||
| .views() | ||
| .as_ref() | ||
| .iter() | ||
| .cloned() | ||
| .map(|v| { | ||
| let mut b: ByteView = ByteView::from(v); | ||
|
|
||
| if b.length > 12 { | ||
| let offset = buffer.len() as u32; | ||
| buffer.extend_from_slice( | ||
| buffers[b.buffer_index as usize] | ||
| .get(b.offset as usize..b.offset as usize + b.length as usize) | ||
| .expect("Invalid buffer slice"), | ||
| ); | ||
| b.offset = offset; | ||
| b.buffer_index = 0; // Set buffer index to 0, as we only have one buffer | ||
| } | ||
|
|
||
| b.into() | ||
| }) | ||
| .collect(); |
There was a problem hiding this comment.
I'm curious it is better if implement like this:
let mut views = Vec::with_capacity(s.views().len());
for &v in s.views().as_ref() {
let mut b: ByteView = ByteView::from(v);
if b.length > 12 {
...
}
views.push(b.into());
}vector pre-allocates and no clone 🤔
There was a problem hiding this comment.
No that is generally worse - .iter().collect() on rust core types optimize to better code:
- it knows the exact final length based on slice and does some extra optimizations based on that (
TrustedLenhttps://doc.rust-lang.org/std/iter/trait.TrustedLen.html) pushgenerally needs to check whether the item fits capacity on each call (so adds some branching).
the cloned() you see on iter().cloned() will generally not make a difference on primitive types for the generated code.
|
Thanks for the reviews @alamb - let's continue making it faster! |
#7623) This reverts commit 7739a83. # Which issue does this PR close? # Rationale for this change I found this errors in DataFusion (see apache/datafusion#16249 (comment)), so let's revert it and find the error. # What changes are included in this PR? # Are there any user-facing changes?
…apache#7614)" (apache#7623)" This reverts commit da461c8.
#7625) … (#7614)" (#7623)" This reverts commit da461c8. This adds a test and fix for the wrong index issue. I also verified the change for DataFusion (and benchmarks show notable improvements). # Which issue does this PR close? Closes #NNN. # Rationale for this change # What changes are included in this PR? # Are there any user-facing changes?
|
🤖 |
Which issue does this PR close?
coalesceandconcatfor views #7615coalescekernel andBatchCoalescerfor statefully combining selected b…atches: #7597Rationale for this change
Improve performance of
gc_string_view_batchWhat changes are included in this PR?
Are there any user-facing changes?
no