Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization: concat function #9732

Merged
merged 4 commits into from
Apr 4, 2024

Conversation

JasonLi-cn
Copy link
Contributor

@JasonLi-cn JasonLi-cn commented Mar 22, 2024

Which issue does this PR close?

Closes #9742

Rationale for this change

optimize concat and concat_ws function.

Benchmark(only concat)

Gnuplot not found, using plotters backend
concat function/concat(old)/1024
                        time:   [91.562 µs 91.725 µs 91.905 µs]
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
concat function/concat(new)/1024
                        time:   [17.831 µs 17.877 µs 17.934 µs]
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe

concat function/concat(old)/4096
                        time:   [357.95 µs 358.98 µs 360.02 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
concat function/concat(new)/4096
                        time:   [71.003 µs 71.168 µs 71.326 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

concat function/concat(old)/8192
                        time:   [758.82 µs 761.43 µs 764.82 µs]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
concat function/concat(new)/8192
                        time:   [141.11 µs 141.44 µs 141.79 µs]

Attention

For the purpose of benchmarking, I haven't officially replaced the concat and concat_ws function yet. If the community finds this PR meaningful, I will proceed with the replacement.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very nice to me @JasonLi-cn -- thank you

datafusion/physical-expr/benches/concat.rs Outdated Show resolved Hide resolved
}
}

struct StringArrayBuilder {
Copy link
Contributor

@alamb alamb Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some comments that explained how this was different than https://docs.rs/arrow/latest/arrow/array/type.StringBuilder.html would help. Maybe simply a note that it didn't check UTF8 again?

I wonder if we could get the same effect by adding an unsafe function to StringBuilder, like

/// Adds bytes to the in progress string, without checking for valid utf8
/// 
/// Safety: requires that bytes are valid utf8, otherwise an invalid StringArray will result
unsafe fn append_unchecked(&mut self, bytes: &[u8]) 

And then using StringBuilder here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Write trait impl of StringBuilder can meet my current needs, but it is not very convenient to use, so I've defined a StringArrayBuilder. I agree with your suggestion to add an unsafe function to StringBuilder.

https://github.com/apache/arrow-rs/blob/9f36c883459405ecd9a5f4fdfa9a3317ab52302c/arrow-array/src/builder/generic_bytes_builder.rs#L231-L257

let mut builder = GenericStringBuilder::<i32>::new();

 // Write data
write!(builder, "foo").unwrap();
write!(builder, "bar").unwrap();

// Finish value
builder.append_value("baz");

// Write second value
write!(builder, "v2").unwrap();
builder.append_value("");
let array = builder.finish();

assert_eq!(array.value(0), "foobarbaz");
assert_eq!(array.value(1), "v2");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue is that we don't need NullBufferBuilder here, but GenericByteBuilder defaults to using NullBufferBuilder, which I believe introduces unnecessary overhead.

fn write<const CHECK_VALID: bool>(&mut self, column: &ColumnarValueRef, i: usize) {
match column {
ColumnarValueRef::Scalar(s) => {
self.value_buffer.extend_from_slice(s);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the primary speed savings gained from not checking UTF8 validity (and just copying byte slices)?

Copy link
Contributor Author

@JasonLi-cn JasonLi-cn Mar 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. My initial intention for optimizing this function was to avoid creating a new String and then using push_str each time in concat function, like:
let mut owned_string: String = "".to_owned();
for arg in args {
    match arg {
        ColumnarValue::Scalar(ScalarValue::Utf8(maybe_value)) => {
            if let Some(value) = maybe_value {
                owned_string.push_str(value);
            }
        }
        ColumnarValue::Array(v) => {
            if v.is_valid(index) {
                let v = as_string_array(v).unwrap();
                owned_string.push_str(v.value(index));
            }
        }
        _ => unreachable!(),
    }
}
Some(owned_string)
  1. Additionally, by precalculating the expected length of the result, I avoided the need to reallocate memory.
  2. I used the extend_from_slice function because I referred to the append_slice function of BufferBuilder.
    #[inline]
    pub fn append_slice(&mut self, slice: &[T]) {
        self.buffer.extend_from_slice(slice);
        self.len += slice.len();
    }

@alamb
Copy link
Contributor

alamb commented Mar 22, 2024

I also filed #9742 to track this improvement

@alamb
Copy link
Contributor

alamb commented Mar 24, 2024

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

@alamb alamb marked this pull request as draft March 24, 2024 09:50
fix: concat_ws

chore: add license header

add arrow feature

update concat
@JasonLi-cn JasonLi-cn force-pushed the feature/optimize_concat_function branch from ceba13d to 9d985c1 Compare March 30, 2024 14:04
@JasonLi-cn JasonLi-cn marked this pull request as ready for review March 30, 2024 14:25
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THank you @JasonLi-cn -- I think this looks good to me

I had some suggestions to improve the comments but I think we could do that as a follow on PR if you prefer

let mut offsets_buffer = MutableBuffer::with_capacity(
(item_capacity + 1) * std::mem::size_of::<i32>(),
);
unsafe { offsets_buffer.push_unchecked(0_i32) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really necessary to avoid the bounds check for a single offset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is safe here and there is a theoretical performance improvement, I have opted to use push_unchecked in this case.

unsafe { self.offsets_buffer.push_unchecked(next_offset) };
}

fn finish(self, null_buffer: Option<NullBuffer>) -> StringArray {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to think if there was some way to create an invalid result with this API and I think the answer is No (even if append_offset wasn't called ever the offsets would still be valid.

@alamb alamb merged commit 24fc99c into apache:main Apr 4, 2024
25 checks passed
@alamb
Copy link
Contributor

alamb commented Apr 4, 2024

Thanks again @JasonLi-cn 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve the performance of concat and concat_ws
2 participants