-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Improve coalesce and concat performance for views
#7614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9ff57e5 to
b3d4337
Compare
b3d4337 to
c3648f4
Compare
gc_string_view_batch performance
gc_string_view_batch performancecoalesce performance
coalesce performancecoalesce and concat performance for views
|
@alamb FYI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR targets performance improvements for view concatenation and coalescing in Arrow by specializing functions and reducing unnecessary memory copies. Key changes include:
- Adding specialized concatenation functions (concat_byte_view and concat_string_view) for view types.
- Refactoring gc_string_view_batch to take ownership of RecordBatches to avoid redundant cloning.
- Updating the GenericByteViewBuilder implementation to use a vector-based buffer instead of a BufferBuilder for better performance and clarity.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| arrow-select/src/concat.rs | Introduces optimized concat functions for Utf8View and BinaryView. |
| arrow-select/src/coalesce.rs | Refactors gc_string_view_batch for improved efficiency and updates associated tests. |
| arrow-array/src/builder/generic_bytes_view_builder.rs | Replaces BufferBuilder with Vec for views and updates related debug output. |
Comments suppressed due to low confidence (3)
arrow-select/src/concat.rs:89
- Consider adding inline documentation for the concat_byte_view function to explain its purpose, expected input, and output behavior.
fn concat_byte_view(arrays: &[&dyn Array]) -> Result<ArrayRef, ArrowError> {
arrow-select/src/coalesce.rs:305
- Add a comment to explain the safety guarantees around using the unsafe StringViewArray::new_unchecked call and why it is acceptable in this context.
let gc_string = unsafe {
arrow-array/src/builder/generic_bytes_view_builder.rs:449
- Update the debug field label from "views_builder" to "views_buffer" so that the debug output accurately reflects the name of the underlying variable.
.field("views_builder", &self.views_buffer)
|
LOVE IT! |
|
🤖 |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a thing of Beauty @Dandandan
I kicked off some other benchmarks just to be sure and left some small potential improvements, but I also think this is better than what is currently on main so we could merge it as is and keep improving as a follow on
| /// using [`GenericByteViewBuilder::try_append_view`] | ||
| pub struct GenericByteViewBuilder<T: ByteViewType + ?Sized> { | ||
| views_builder: BufferBuilder<u128>, | ||
| views_buffer: Vec<u128>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a great idea 💯
| self.flush_in_progress(); | ||
| self.completed.extend(array.data_buffers().iter().cloned()); | ||
|
|
||
| if self.completed.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This checks completed.is_empty after pushing new data buffers, which I think means the fast path will never be taken. I think the check could be done prior to calling self.completed.extend and improve performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't resist and tried the following change locally and saw some additional improvements in concat but also some slowdowns with entirely inlined views, which I don't understand 🤔
diff --git a/arrow-array/src/builder/generic_bytes_view_builder.rs b/arrow-array/src/builder/generic_bytes_view_builder.rs
index 1e06cd7698..3936ca06d4 100644
--- a/arrow-array/src/builder/generic_bytes_view_builder.rs
+++ b/arrow-array/src/builder/generic_bytes_view_builder.rs
@@ -121,7 +121,7 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
/// growing buffer size exponentially from 8KB up to 2MB. The
/// first buffer allocated is 8KB, then 16KB, then 32KB, etc up to 2MB.
///
- /// If this method is used, any new buffers allocated are
+ /// If this method is used, any new buffers allocated are
/// exactly this size. This can be useful for advanced users
/// that want to control the memory usage and buffer count.
///
@@ -210,9 +210,11 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
/// and add the (adapted) views.
pub fn append_array(&mut self, array: &GenericByteViewArray<T>) {
self.flush_in_progress();
+ let update_views = !(self.completed.is_empty() || array.data_buffers().is_empty());
+
self.completed.extend(array.data_buffers().iter().cloned());
- if self.completed.is_empty() {
+ if update_views {
self.views_buffer.extend_from_slice(array.views());
} else {
let starting_buffer = self.completed.len() as u32;group coalesce_improvements coalesce_improvements_fix
----- --------------------- -------------------------
concat utf8_view max_str_len=128 null_density=0 1.53 80.4±0.42µs ? ?/sec 1.00 52.5±2.85µs ? ?/sec
concat utf8_view max_str_len=128 null_density=0.2 1.51 86.4±0.34µs ? ?/sec 1.00 57.3±2.97µs ? ?/sec
concat utf8_view max_str_len=20 null_density=0 1.58 80.2±0.36µs ? ?/sec 1.00 50.6±2.33µs ? ?/sec
concat utf8_view max_str_len=20 null_density=0.2 1.65 98.9±0.36µs ? ?/sec 1.00 59.9±2.19µs ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0 1.00 40.6±2.82µs ? ?/sec 1.94 78.7±1.18µs ? ?/sec
concat utf8_view all_inline max_str_len=12 null_density=0.2 1.00 53.9±2.78µs ? ?/sec 1.95 105.1±0.26µs ? ?/sec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be:
@@ -210,9 +210,11 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
/// and add the (adapted) views.
pub fn append_array(&mut self, array: &GenericByteViewArray<T>) {
self.flush_in_progress();
+ let keep_views = self.completed.is_empty() || array.data_buffers().is_empty();
+
self.completed.extend(array.data_buffers().iter().cloned());
- if self.completed.is_empty() {
+ if keep_views {
self.views_buffer.extend_from_slice(array.views());
} else {
let starting_buffer = self.completed.len() as u32;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me this is a all-positive change 🎉 I really like the >2x improvement on inlined views.
concat utf8_view all_inline max_str_len=12 null_density=0
time: [25.775 µs 27.046 µs 28.546 µs]
change: [−55.148% −53.600% −51.725%] (p = 0.00 < 0.05)
Performance has improved.
Found 23 outliers among 100 measurements (23.00%)
7 (7.00%) low mild
5 (5.00%) high mild
11 (11.00%) high severe
concat utf8_view max_str_len=20 null_density=0
time: [53.475 µs 53.934 µs 54.689 µs]
change: [−7.1979% −6.4937% −5.5943%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
1 (1.00%) low mild
3 (3.00%) high mild
3 (3.00%) high severe
concat utf8_view max_str_len=128 null_density=0
time: [52.999 µs 53.106 µs 53.245 µs]
change: [−7.0948% −6.6147% −6.0319%] (p = 0.00 < 0.05)
Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
7 (7.00%) high mild
8 (8.00%) high severe
concat utf8_view all_inline max_str_len=12 null_density=0.2
time: [25.532 µs 25.741 µs 25.943 µs]
change: [−56.519% −56.108% −55.710%] (p = 0.00 < 0.05)
Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
8 (8.00%) low mild
3 (3.00%) high mild
concat utf8_view max_str_len=20 null_density=0.2
time: [56.062 µs 56.342 µs 56.688 µs]
change: [−6.1050% −5.3513% −4.6591%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) low mild
4 (4.00%) high severe
concat utf8_view max_str_len=128 null_density=0.2
time: [55.406 µs 55.623 µs 55.891 µs]
change: [−6.0536% −5.4045% −4.8059%] (p = 0.00 < 0.05)
Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
1 (1.00%) low mild
1 (1.00%) high mild
12 (12.00%) high severe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be:
🤦
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
🤔 seems there is no |
|
I also ran this PR with a new concat benchmark in And it also shows pretty sweet gains |
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Which issue does this PR close? - Related to #7614 # Rationale for this change We expect #7614 to improve concat performance, but we don't have any concat benchmarks for StringViewArray Let's fix that # What changes are included in this PR? 1. Add a `concat` benchmark for StringViewArray # Are there any user-facing changes? No this is an internal benchmarking tool only
|
🤖 |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great -- I kicked off some more tests
| .iter() | ||
| fn gc_string_view_batch(batch: RecordBatch) -> RecordBatch { | ||
| let (schema, columns, num_rows) = batch.into_parts(); | ||
| let new_columns: Vec<ArrayRef> = columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if it matters but in #7620 I simply modified the existing Vec rather than collecting into a new one to save maybe an allocation
for mut c in columns.iter_mut() {
...
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK there is a Rust compiler optimization that will reuse a Vec allocation if using into_iter / collect (on owned Vec) 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the reasons I think using Vec arrow-rs makes sense, there are many optimizations that are not feasible to implement in arrow-rs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think using Vec is a great idea - it is both simpler and as can be seen by this PR often quite a bit faster
| }) | ||
| .collect(); | ||
|
|
||
| let buffers = if buffer.is_empty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only way this could happen is if ideal_buffer_size is zero, right? In that case we could make a more efficient inner loop that simply copied the views without checking length, etc
if ideal_buffer_size == 0 {
// just reuse views as is...
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added 🚀
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
| let views: Vec<u128> = s | ||
| .views() | ||
| .as_ref() | ||
| .iter() | ||
| .cloned() | ||
| .map(|v| { | ||
| let mut b: ByteView = ByteView::from(v); | ||
|
|
||
| if b.length > 12 { | ||
| let offset = buffer.len() as u32; | ||
| buffer.extend_from_slice( | ||
| buffers[b.buffer_index as usize] | ||
| .get(b.offset as usize..b.offset as usize + b.length as usize) | ||
| .expect("Invalid buffer slice"), | ||
| ); | ||
| b.offset = offset; | ||
| b.buffer_index = 0; // Set buffer index to 0, as we only have one buffer | ||
| } | ||
|
|
||
| b.into() | ||
| }) | ||
| .collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious it is better if implement like this:
let mut views = Vec::with_capacity(s.views().len());
for &v in s.views().as_ref() {
let mut b: ByteView = ByteView::from(v);
if b.length > 12 {
...
}
views.push(b.into());
}vector pre-allocates and no clone 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No that is generally worse - .iter().collect() on rust core types optimize to better code:
- it knows the exact final length based on slice and does some extra optimizations based on that (
TrustedLenhttps://doc.rust-lang.org/std/iter/trait.TrustedLen.html) pushgenerally needs to check whether the item fits capacity on each call (so adds some branching).
the cloned() you see on iter().cloned() will generally not make a difference on primitive types for the generated code.
|
Thanks for the reviews @alamb - let's continue making it faster! |
#7623) This reverts commit 7739a83. # Which issue does this PR close? # Rationale for this change I found this errors in DataFusion (see apache/datafusion#16249 (comment)), so let's revert it and find the error. # What changes are included in this PR? # Are there any user-facing changes?
…apache#7614)" (apache#7623)" This reverts commit da461c8.
#7625) … (#7614)" (#7623)" This reverts commit da461c8. This adds a test and fix for the wrong index issue. I also verified the change for DataFusion (and benchmarks show notable improvements). # Which issue does this PR close? Closes #NNN. # Rationale for this change # What changes are included in this PR? # Are there any user-facing changes?
|
🤖 |
Which issue does this PR close?
coalesceandconcatfor views #7615coalescekernel andBatchCoalescerfor statefully combining selected b…atches: #7597Rationale for this change
Improve performance of
gc_string_view_batchWhat changes are included in this PR?
Are there any user-facing changes?
no