-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization: concat function #9732
Optimization: concat function #9732
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very nice to me @JasonLi-cn -- thank you
} | ||
} | ||
|
||
struct StringArrayBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some comments that explained how this was different than https://docs.rs/arrow/latest/arrow/array/type.StringBuilder.html would help. Maybe simply a note that it didn't check UTF8 again?
I wonder if we could get the same effect by adding an unsafe
function to StringBuilder
, like
/// Adds bytes to the in progress string, without checking for valid utf8
///
/// Safety: requires that bytes are valid utf8, otherwise an invalid StringArray will result
unsafe fn append_unchecked(&mut self, bytes: &[u8])
And then using StringBuilder
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Write
trait impl of StringBuilder can meet my current needs, but it is not very convenient to use, so I've defined a StringArrayBuilder. I agree with your suggestion to add an unsafe function to StringBuilder.
let mut builder = GenericStringBuilder::<i32>::new();
// Write data
write!(builder, "foo").unwrap();
write!(builder, "bar").unwrap();
// Finish value
builder.append_value("baz");
// Write second value
write!(builder, "v2").unwrap();
builder.append_value("");
let array = builder.finish();
assert_eq!(array.value(0), "foobarbaz");
assert_eq!(array.value(1), "v2");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another issue is that we don't need NullBufferBuilder
here, but GenericByteBuilder
defaults to using NullBufferBuilder
, which I believe introduces unnecessary overhead.
fn write<const CHECK_VALID: bool>(&mut self, column: &ColumnarValueRef, i: usize) { | ||
match column { | ||
ColumnarValueRef::Scalar(s) => { | ||
self.value_buffer.extend_from_slice(s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the primary speed savings gained from not checking UTF8 validity (and just copying byte slices)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- My initial intention for optimizing this function was to avoid creating a new String and then using
push_str
each time inconcat
function, like:
let mut owned_string: String = "".to_owned();
for arg in args {
match arg {
ColumnarValue::Scalar(ScalarValue::Utf8(maybe_value)) => {
if let Some(value) = maybe_value {
owned_string.push_str(value);
}
}
ColumnarValue::Array(v) => {
if v.is_valid(index) {
let v = as_string_array(v).unwrap();
owned_string.push_str(v.value(index));
}
}
_ => unreachable!(),
}
}
Some(owned_string)
- Additionally, by precalculating the expected length of the result, I avoided the need to reallocate memory.
- I used the
extend_from_slice
function because I referred to theappend_slice
function ofBufferBuilder
.
#[inline]
pub fn append_slice(&mut self, slice: &[T]) {
self.buffer.extend_from_slice(slice);
self.len += slice.len();
}
I also filed #9742 to track this improvement |
Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look |
fix: concat_ws chore: add license header add arrow feature update concat
ceba13d
to
9d985c1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THank you @JasonLi-cn -- I think this looks good to me
I had some suggestions to improve the comments but I think we could do that as a follow on PR if you prefer
let mut offsets_buffer = MutableBuffer::with_capacity( | ||
(item_capacity + 1) * std::mem::size_of::<i32>(), | ||
); | ||
unsafe { offsets_buffer.push_unchecked(0_i32) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really necessary to avoid the bounds check for a single offset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it is safe here and there is a theoretical performance improvement, I have opted to use push_unchecked
in this case.
unsafe { self.offsets_buffer.push_unchecked(next_offset) }; | ||
} | ||
|
||
fn finish(self, null_buffer: Option<NullBuffer>) -> StringArray { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to think if there was some way to create an invalid result with this API and I think the answer is No (even if append_offset wasn't called ever the offsets would still be valid.
Thanks again @JasonLi-cn 🙏 |
Which issue does this PR close?
Closes #9742
Rationale for this change
optimize
concat
andconcat_ws
function.Benchmark(only concat)
Attention
For the purpose of benchmarking, I haven't officially replaced the
concat
andconcat_ws
function yet. If the community finds this PR meaningful, I will proceed with the replacement.What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?