Improve performance of constructing `ByteView`s for small strings

Part of https://github.com/apache/arrow-rs/issues/5374

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

NoteL This is copied from the wonderful description on  https://github.com/apache/arrow-rs/pull/6031 from @XiangpengHao  and @aoil-al

In our benchmark, [every string is larger than 12 bytes](https://github.com/XiangpengHao/arrow-rs/blob/view-buffer/parquet/benches/arrow_reader.rs#L288), this means that when making the views, we always fall into [this branch](https://github.com/XiangpengHao/arrow-rs/blob/view-buffer/arrow-array/src/builder/generic_bytes_view_builder.rs#L419-L424). The specialty about this branch is that it only reads the first 4 bytes of the string, and LLVM is smart enough to inline loading the values (i.e., prevent calling into `copy_non_overlapping`). 

#### Is that a big deal?
Unfortunately, yes. If you change the benchmark string to very small, e.g., `"test"`, you should be able to see (using the same benchmark script above) the performance of loading `ViewArray` drops significantly, and it doesn't make sense -- how could building `StringViewArray`s wit shorter strings be so much slower than building longer strings?

The root cause is that when string is short, we need to `memcpy` the bytes to the `view`, as the compiler has no idea how long the string is, it can not inline the bytes and has to call to `memcpy`, which is slow.

**Describe the solution you'd like**

I would like to improve the performance of StringView creation more

**Describe alternatives you've considered**

here's a special variant of `make_view`, you can replace [this function](https://github.com/apache/arrow-rs/blob/master/arrow-array/src/builder/generic_bytes_view_builder.rs#L411) with the following new implementation, and the performance improves. 

This new implementation makes 13 copies of make_view_inner; each copy is a specialized length known to the compiler (`LEN` is const), and then the compiler can optimize each of the implementations as needed. Looking at the assembly code, there is indeed no call to `memcpy`.

```rust
/// Creates a view from a fixed length input (the compiler can generate
/// specialized code for this)
fn make_view_inner<const LEN: usize>(data: &[u8]) -> u128 {
    let mut view_buffer = [0; 16];
    view_buffer[0..4].copy_from_slice(&(LEN as u32).to_le_bytes());
    view_buffer[4..4 + LEN].copy_from_slice(&data[..LEN]);
    u128::from_le_bytes(view_buffer)
}

/// Create a view based on the given data, block id and offset
#[inline(always)]
pub fn make_view(data: &[u8], block_id: u32, offset: u32) -> u128 {
    let len = data.len() as u32;
    // Generate specialized code for each potential small string length
    // to improve performance
    match len {
        0 => make_view_inner::<0>(data),
        1 => make_view_inner::<1>(data),
        2 => make_view_inner::<2>(data),
        3 => make_view_inner::<3>(data),
        4 => make_view_inner::<4>(data),
        5 => make_view_inner::<5>(data),
        6 => make_view_inner::<6>(data),
        7 => make_view_inner::<7>(data),
        8 => make_view_inner::<8>(data),
        9 => make_view_inner::<9>(data),
        10 => make_view_inner::<10>(data),
        11 => make_view_inner::<11>(data),
        12 => make_view_inner::<12>(data),
        _ => {
            let view = ByteView {
                length: len,
                prefix: u32::from_le_bytes(data[0..4].try_into().unwrap()),
                buffer_index: block_id,
                offset,
            };
            view.into()
        }
    }
}

```
 (special thanks to @aoli-al for triangulating the root cause and prototype the fix)


While the above is somewhat non intuitive, if it improves performance and is sufficiently commented, it would be a good thing to add



**Additional context**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve performance of constructing `ByteView`s for small strings #6034

Is that a big deal?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve performance of constructing ByteViews for small strings #6034

Description

Is that a big deal?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Improve performance of constructing `ByteView`s for small strings #6034