Skip to content

Support Utf8View for Avro #7434

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 21, 2025
Merged

Conversation

kumarlokesh
Copy link
Contributor

@kumarlokesh kumarlokesh commented Apr 22, 2025

Which issue does this PR close?

Closes #7262.

Rationale for this change

What changes are included in this PR?

This PR adds support for Utf8View to the arrow-avro reader, enabling the creation of StringViewArray instead of regular StringArray for string data.

  • Added Utf8View variant to Codec enum and with_utf8view method for codec conversion
  • Updated Decoder enum with a StringView variant to handle StringViewArray creation
  • Added a configuration option use_utf8view to ReadOptions struct
  • Added benchmarks
  • Created an example application demonstrating how to use the feature

Benchmark output:

array_creation/string_array_10_chars                                                                            
                        time:   [9.5921 µs 9.6520 µs 9.7296 µs]
                        change: [-7.4725% -5.8906% -4.2024%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
  1 (5.00%) high mild
  1 (5.00%) high severe
array_creation/string_view_10_chars                                                                            
                        time:   [10.053 µs 10.128 µs 10.207 µs]
                        change: [-3.1158% -0.9311% +1.0989%] (p = 0.43 > 0.05)
                        No change in performance detected.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high mild
array_creation/string_array_100_chars                                                                            
                        time:   [16.242 µs 16.372 µs 16.576 µs]
                        change: [-1.3374% +0.4015% +2.4089%] (p = 0.71 > 0.05)
                        No change in performance detected.
Found 2 outliers among 20 measurements (10.00%)
  1 (5.00%) high mild
  1 (5.00%) high severe
array_creation/string_view_100_chars                                                                            
                        time:   [13.027 µs 13.290 µs 13.592 µs]
                        change: [-16.763% -12.893% -9.4265%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
  1 (5.00%) high mild
  1 (5.00%) high severe
array_creation/string_array_1000_chars                                                                           
                        time:   [76.218 µs 77.540 µs 78.691 µs]
                        change: [-8.4455% -6.3045% -4.1053%] (p = 0.00 < 0.05)
                        Performance has improved.
array_creation/string_view_1000_chars                                                                            
                        time:   [45.964 µs 46.399 µs 46.770 µs]
                        change: [-8.7387% -3.9085% -0.5912%] (p = 0.07 > 0.05)
                        No change in performance detected.

string_operations/string_array_value_10_chars                                                                            
                        time:   [546.50 ns 565.27 ns 585.60 ns]
                        change: [-16.303% -10.503% -5.2608%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
string_operations/string_view_value_10_chars                                                                            
                        time:   [973.27 ns 992.09 ns 1.0101 µs]
                        change: [-14.527% -5.2942% +1.9214%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 3 outliers among 20 measurements (15.00%)
  1 (5.00%) low mild
  2 (10.00%) high mild
string_operations/string_array_value_100_chars                                                                            
                        time:   [548.03 ns 559.51 ns 573.23 ns]
                        change: [+1.1381% +3.0271% +5.0339%] (p = 0.01 < 0.05)
                        Performance has regressed.
string_operations/string_view_value_100_chars                                                                            
                        time:   [987.22 ns 1.0068 µs 1.0248 µs]
                        change: [-8.6431% +0.4588% +6.7220%] (p = 0.93 > 0.05)
                        No change in performance detected.
string_operations/string_array_value_1000_chars                                                                            
                        time:   [545.70 ns 556.69 ns 571.33 ns]
                        change: [-13.942% -8.4282% -3.7589%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 20 measurements (15.00%)
  1 (5.00%) high mild
  2 (10.00%) high severe
string_operations/string_view_value_1000_chars                                                                            
                        time:   [947.71 ns 960.19 ns 974.77 ns]
                        change: [+0.9469% +4.6877% +8.4201%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild

avro_reader/string_array_10_chars                                                                           
                        time:   [149.00 µs 152.79 µs 155.90 µs]
                        change: [+2.5473% +5.4637% +8.5654%] (p = 0.00 < 0.05)
                        Performance has regressed.
avro_reader/string_view_10_chars                                                                           
                        time:   [141.75 µs 142.96 µs 144.64 µs]
                        change: [-14.307% -5.2067% +1.3228%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
avro_reader/string_array_100_chars                                                                           
                        time:   [173.20 µs 177.79 µs 183.34 µs]
                        change: [-4.6247% +1.1333% +5.8015%] (p = 0.72 > 0.05)
                        No change in performance detected.
avro_reader/string_view_100_chars                                                                           
                        time:   [174.64 µs 178.95 µs 183.48 µs]
                        change: [-10.183% -0.2004% +7.0467%] (p = 0.97 > 0.05)
                        No change in performance detected.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
avro_reader/string_array_1000_chars                                                                           
                        time:   [404.78 µs 415.51 µs 428.94 µs]
                        change: [-12.609% -2.9688% +4.7088%] (p = 0.64 > 0.05)
                        No change in performance detected.
avro_reader/string_view_1000_chars                                                                           
                        time:   [392.38 µs 403.15 µs 416.30 µs]
                        change: [+4.5005% +11.040% +18.788%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 20 measurements (10.00%)
  1 (5.00%) high mild
  1 (5.00%) high severe

From the above benchmark results, we can draw following conclusions:

1. Array Creation Performance
1.1 Small Strings (10 chars):
StringArray: 9.65 µs
StringViewArray: 10.13 µs
StringArray is about 5% faster for small strings

1.2 Medium Strings (100 chars):
StringArray: 16.37 µs
StringViewArray: 13.29 µs
StringViewArray is about 19% faster

1.3 Large Strings (1000 chars):
StringArray: 77.54 µs
StringViewArray: 46.40 µs
StringViewArray is about 40% faster

2. String Value Access Performance
Value Access (all string sizes):
StringArray: ~560 ns
StringViewArray: ~980 ns
StringArray is consistently about 43% faster for value access

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 22, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kumarlokesh -- this looks really nice

To make reviewing this PR easier, is there any way you can break it into multiple smaller PRs? Specifically,

  1. Improve the comments
  2. Support for Map
  3. Support for Utf8View?

Also it is not clear to me how much of a conflict this PR would have with:

pub fn codec(&self) -> &Codec {
&self.codec
}

/// Returns the nullability status of this data type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice -- thank you

ComplexType::Map(m) => Err(ArrowError::NotYetImplemented(format!(
"Map of {m:?} not currently supported"
))),
ComplexType::Map(m) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be a new feature (not just string view, but also map support)

Have you had a chance to review this PR?

@@ -23,9 +23,16 @@ use std::io::Read;
pub const CODEC_METADATA_KEY: &str = "avro.codec";

#[derive(Debug, Copy, Clone, Eq, PartialEq)]
/// Supported compression codecs for Avro data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these comments

@kumarlokesh kumarlokesh marked this pull request as draft April 28, 2025 20:17
@kumarlokesh
Copy link
Contributor Author

@alamb moved improvements to comments to another PR #7449

@kumarlokesh kumarlokesh force-pushed the support-utf8-view-for-avro branch from 13d05d5 to ae4fe0b Compare May 18, 2025 16:27
@kumarlokesh kumarlokesh marked this pull request as ready for review May 18, 2025 16:49
@kumarlokesh
Copy link
Contributor Author

@alamb updated this PR to remove "2. Support for Map". The sole focus of this PR is "3. Support for Utf8View".

In terms of scope of work between between #7434 and #6965, don't see a major overlap between what's intended. Therefore, I think the code conflict between the two should be minimal, and fairly trivial to resolve.

Thank you @kumarlokesh -- this looks really nice

To make reviewing this PR easier, is there any way you can break it into multiple smaller PRs? Specifically,

  1. Improve the comments
  2. Support for Map
  3. Support for Utf8View?

Also it is not clear to me how much of a conflict this PR would have with:

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kumarlokesh -- I think this looks really nicely documented and tested.

I had a few API questions / suggestions but I don't think they are necessary.

It seems to me that this PR does not change any APIs (only adds new ones) so I think we could merge it before the next major release. What do you think?

Thanks again for pushing this along and breaking it up into smaller chunks

}

#[test]
fn test_string_with_utf8view_enabled() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -36,19 +36,37 @@ pub mod reader;
/// Avro schema parsing and representation
///
/// Provides types for parsing and representing Avro schema definitions.
mod schema;
pub mod schema;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it needed to make these module public?

/// * `use_utf8view` - If true, use StringViewArray instead of StringArray for string data
pub fn try_new_with_options(
data_type: &AvroDataType,
use_utf8view: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we use ReadOptions here rather than the single field? That would make updating these easier in the future

/// into Arrow's StringViewArray instead of the standard StringArray.
///
/// Default: false
pub use_utf8view: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice if this field was not pub -- that way we could add new options to ReadOptions in the future without it being a breaking API change

@alamb alamb added the next-major-release the PR has API changes and it waiting on the next major version label May 19, 2025
@alamb
Copy link
Contributor

alamb commented May 19, 2025

Marking as next-major-release as a precaution to avoid merging until we convince ourselves this PR has no breaking changes

@kumarlokesh
Copy link
Contributor Author

Thank you @kumarlokesh -- I think this looks really nicely documented and tested.

I had a few API questions / suggestions but I don't think they are necessary.

It seems to me that this PR does not change any APIs (only adds new ones) so I think we could merge it before the next major release. What do you think?

Thanks again for pushing this along and breaking it up into smaller chunks

@alamb thanks for the review! Yes, I think this PR could be merged before the next major release.

Addressed review comments here cf50b5b

@alamb alamb removed the next-major-release the PR has API changes and it waiting on the next major version label May 21, 2025

let mut file = temp_file.reopen()?;

file.write_all(b"AVRO")?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered why this benchmark isn't using a avro writer and then I realized it is because there is no writer in avro-arrow 👍

@@ -139,6 +155,11 @@ pub enum Codec {
Binary,
/// String data represented as UTF-8 encoded bytes, corresponding to Arrow's StringArray
Utf8,
/// String data represented as UTF-8 encoded bytes with an optimized view representation,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not pub it seems: https://docs.rs/arrow-avro/latest/arrow_avro/?search=Codec

Though now I look it seems there is nothing that is pub: https://docs.rs/arrow-avro/latest/arrow_avro/all.html 🤔

@alamb
Copy link
Contributor

alamb commented May 21, 2025

Let's keep thing moving and merge this PR.

@tustvold do you remember what the plan / status of the arrow-avro crate is? Did we mark it as experimental or something so we could change the APIs between major releases?

@alamb alamb merged commit 36b2a27 into apache:main May 21, 2025
24 checks passed
@alamb
Copy link
Contributor

alamb commented May 21, 2025

Thanks again @kumarlokesh

@kumarlokesh kumarlokesh deleted the support-utf8-view-for-avro branch May 21, 2025 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Utf8View for avro
2 participants