Skip to content

Improve performance of constructing ByteViews for small strings #6034

@alamb

Description

@alamb

Part of #5374

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

NoteL This is copied from the wonderful description on #6031 from @XiangpengHao and @aoil-al

In our benchmark, every string is larger than 12 bytes, this means that when making the views, we always fall into this branch. The specialty about this branch is that it only reads the first 4 bytes of the string, and LLVM is smart enough to inline loading the values (i.e., prevent calling into copy_non_overlapping).

Is that a big deal?

Unfortunately, yes. If you change the benchmark string to very small, e.g., "test", you should be able to see (using the same benchmark script above) the performance of loading ViewArray drops significantly, and it doesn't make sense -- how could building StringViewArrays wit shorter strings be so much slower than building longer strings?

The root cause is that when string is short, we need to memcpy the bytes to the view, as the compiler has no idea how long the string is, it can not inline the bytes and has to call to memcpy, which is slow.

Describe the solution you'd like

I would like to improve the performance of StringView creation more

Describe alternatives you've considered

here's a special variant of make_view, you can replace this function with the following new implementation, and the performance improves.

This new implementation makes 13 copies of make_view_inner; each copy is a specialized length known to the compiler (LEN is const), and then the compiler can optimize each of the implementations as needed. Looking at the assembly code, there is indeed no call to memcpy.

/// Creates a view from a fixed length input (the compiler can generate
/// specialized code for this)
fn make_view_inner<const LEN: usize>(data: &[u8]) -> u128 {
    let mut view_buffer = [0; 16];
    view_buffer[0..4].copy_from_slice(&(LEN as u32).to_le_bytes());
    view_buffer[4..4 + LEN].copy_from_slice(&data[..LEN]);
    u128::from_le_bytes(view_buffer)
}

/// Create a view based on the given data, block id and offset
#[inline(always)]
pub fn make_view(data: &[u8], block_id: u32, offset: u32) -> u128 {
    let len = data.len() as u32;
    // Generate specialized code for each potential small string length
    // to improve performance
    match len {
        0 => make_view_inner::<0>(data),
        1 => make_view_inner::<1>(data),
        2 => make_view_inner::<2>(data),
        3 => make_view_inner::<3>(data),
        4 => make_view_inner::<4>(data),
        5 => make_view_inner::<5>(data),
        6 => make_view_inner::<6>(data),
        7 => make_view_inner::<7>(data),
        8 => make_view_inner::<8>(data),
        9 => make_view_inner::<9>(data),
        10 => make_view_inner::<10>(data),
        11 => make_view_inner::<11>(data),
        12 => make_view_inner::<12>(data),
        _ => {
            let view = ByteView {
                length: len,
                prefix: u32::from_le_bytes(data[0..4].try_into().unwrap()),
                buffer_index: block_id,
                offset,
            };
            view.into()
        }
    }
}

(special thanks to @aoli-al for triangulating the root cause and prototype the fix)

While the above is somewhat non intuitive, if it improves performance and is sufficiently commented, it would be a good thing to add

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    arrowChanges to the arrow crateenhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions