Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions rust/arrow/benches/comparison_kernels.rs
Original file line number Diff line number Diff line change
Expand Up @@ -121,8 +121,8 @@ fn bench_nlike_utf8_scalar(arr_a: &StringArray, value_b: &str) {

fn add_benchmark(c: &mut Criterion) {
let size = 65536;
let arr_a = create_primitive_array::<Float32Type>(size, 0.0);
let arr_b = create_primitive_array::<Float32Type>(size, 0.0);
let arr_a = create_primitive_array_with_seed::<Float32Type>(size, 0.0, 42);
let arr_b = create_primitive_array_with_seed::<Float32Type>(size, 0.0, 43);

let arr_string = create_string_array(size, 0.0);

Expand Down
55 changes: 55 additions & 0 deletions rust/arrow/src/buffer/mutable.rs
Original file line number Diff line number Diff line change
Expand Up @@ -415,6 +415,61 @@ impl MutableBuffer {
buffer
}

/// Creates a [`MutableBuffer`] from a boolean [`Iterator`] with a trusted (upper) length.
/// # use arrow::buffer::MutableBuffer;
/// # Example
/// ```
/// # use arrow::buffer::MutableBuffer;
/// let v = vec![false, true, false];
/// let iter = v.iter().map(|x| *x || true);
/// let buffer = unsafe { MutableBuffer::from_trusted_len_iter_bool(iter) };
/// assert_eq!(buffer.len(), 1) // 3 booleans have 1 byte
/// ```
/// # Safety
/// This method assumes that the iterator's size is correct and is undefined behavior
/// to use it on an iterator that reports an incorrect length.
// This implementation is required for two reasons:
// 1. there is no trait `TrustedLen` in stable rust and therefore
// we can't specialize `extend` for `TrustedLen` like `Vec` does.
// 2. `from_trusted_len_iter_bool` is faster.
pub unsafe fn from_trusted_len_iter_bool<I: Iterator<Item = bool>>(
mut iterator: I,
) -> Self {
let (_, upper) = iterator.size_hint();
let upper = upper.expect("from_trusted_len_iter requires an upper limit");

let mut result = {
let byte_capacity: usize = upper.saturating_add(7) / 8;
MutableBuffer::new(byte_capacity)
};

'a: loop {
let mut byte_accum: u8 = 0;
let mut mask: u8 = 1;

//collect (up to) 8 bits into a byte
while mask != 0 {
if let Some(value) = iterator.next() {
byte_accum |= match value {
Copy link
Contributor

@yordan-pavlov yordan-pavlov Mar 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the bool iterator could be split into chunks (for example, using https://docs.rs/itertools/0.4.2/itertools/struct.Chunks.html or alternatively using https://doc.rust-lang.org/std/primitive.slice.html#method.chunks) of 8 bool values, then each chunk is mapped into a byte by converting each bool value into a byte (for example using std::mem::transmute::<bool, u8>), then shifting according to the position in the chunk, and applying in the output byte, and finally the resulting byte iterator would be used to build the buffer directly. This is the fastest implementation I can imagine because it eliminates as many conditions / checks as possible (and conditions are the enemy of fast).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think there are some faster ways to speed up the inner loop, yours sounds like a great idea to try out. I was also looking at the arrow2 repository of @jorgecarleitao , but I think I have been looking to an older commit before which turned out to be slower (I expected it to be faster, but sometimes the compiler can be quite surprising in what compiles to efficient code.

I think the latest version is over here:

https://github.com/jorgecarleitao/arrow2/blob/be905f1b1f0293ef427387bc35b2e9956ec3336f/src/bitmap/mutable.rs#L209

Copy link
Member

@jorgecarleitao jorgecarleitao Mar 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you all ❤️

I admit I have spent an immoral amount of time trying to optimize bitmaps, but I have unfortunately not yet concluded what is the best way to handle them. I think that we may not being able to express to the compiler what we want it to do (some kind of operation over a single byte). @yordan-pavlov suggestion is a great one in that direction, though.

FWIW, on my computer (a VM on azure), arrow master (not this PR) is giving

eq Float32              time:   [113.40 us 114.81 us 116.28 us]
eq scalar Float32       time:   [96.824 us 98.638 us 101.34 us]

and arrow2 is giving

eq Float32              time:   [84.519 us 86.065 us 87.772 us]
eq scalar Float32       time:   [57.682 us 58.315 us 59.014 us]

This PR's idea on arrow2 (with corresponding changes) is giving me -14% on eq Float32 and +35% on eq scalar Float32. I pushed these benches to master there.

Note the difference between scalar and non-scalar: it is the exact same code on the trusted_len function, but a 30% difference in performance between them; imo this indicates that we are fighting with the compiler to try to explain what we are trying to express here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I wrote a minimal repo to evaluate these things and added a reddit post to try to get some help / feedback on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yordan-pavlov's idea yields -50% on array-to-scalar and -10% on array-to-array 🚀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yordan-pavlov
I am not sure where the win would be in that case.
I would expect the first idea to be compiled to roughly the same code (if all compiler optimizations work out)?

For the Vec one - I would expect that would be slower as it introduces an extra loop / allocation and barrier for optimization?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dandandan yes, I agree that it's counter-intuitive, but I find that the compiler often surprises me so it's best to try every option; I will try to extend the benchmark repo when I have a bit more free time later today

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, sometimes the result can be surprising.
I tried this variation, this compiles to the same unrolled 38 instructions:

            chunk.iter().map(|&c_i| c_i == rhs).enumerate().for_each(|(i, c_i)| {
                *byte |= if c_i {
                    1 << i
                } else {
                    0
                };

Copy link
Contributor

@yordan-pavlov yordan-pavlov Mar 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I tried to benchmark doing the comparison separately, but it's not faster; on my machine the fastest version is:

(0..8).for_each(|i| {
                if chunk[i] == rhs {
                    *byte = set(*byte, i)
                }
            });

and that's even faster than:

chunk.iter().enumerate().for_each(|(i, &c_i)| {
                *byte |= unsafe { mem::transmute::<bool, u8>(c_i == rhs) << i };
            });

this is with the test data configured as:

let vec = (0..20049).map(|x| (x * x + x) % 2).collect::<Vec<_>>();

I think 2000 items (the old length of the test data) is much too small for realistic benchmarking, and it would make more sense to benchmark with test data with length same as the default batch size in DataFusion (I think this was recently increased).

Copy link
Contributor Author

@Dandandan Dandandan Mar 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yordan-pavlov

Also see findings in the PR here (with updated benchmark). (x * x + x) % 2 will give 0 on even and 1 on uneven inputs I, so the pattern/branches will be very predictable, especially in the scalar version.

jorgecarleitao/arrow2#17

true => mask,
false => 0,
};
mask <<= 1;
} else {
if mask != 1 {
// Add last byte
result.push_unchecked(byte_accum);
}
break 'a;
}
}

// Soundness: from_trusted_len
result.push_unchecked(byte_accum);
}
result
}

/// Creates a [`MutableBuffer`] from an [`Iterator`] with a trusted (upper) length or errors
/// if any of the items of the iterator is an error.
/// Prefer this to `collect` whenever possible, as it is faster ~60% faster.
Expand Down
112 changes: 102 additions & 10 deletions rust/arrow/src/compute/kernels/comparison.rs
Original file line number Diff line number Diff line change
Expand Up @@ -47,17 +47,72 @@ macro_rules! compare_op {
let null_bit_buffer =
combine_option_bitmap($left.data_ref(), $right.data_ref(), $left.len())?;

let buffer = (0..$left.len())
.map(|i| $op($left.value(i), $right.value(i)))
.collect();
let comparison = (0..$left.len()).map(|i| $op($left.value(i), $right.value(i)));
// same size as $left.len() and $right.len()
let buffer = unsafe { MutableBuffer::from_trusted_len_iter_bool(comparison) };

let data = ArrayData::new(
DataType::Boolean,
$left.len(),
None,
null_bit_buffer,
0,
vec![buffer],
vec![Buffer::from(buffer)],
vec![],
);
Ok(BooleanArray::from(Arc::new(data)))
}};
}

macro_rules! compare_op_primitive {
($left: expr, $right:expr, $op:expr) => {{
if $left.len() != $right.len() {
return Err(ArrowError::ComputeError(
"Cannot perform comparison operation on arrays of different length"
.to_string(),
));
}

let null_bit_buffer =
combine_option_bitmap($left.data_ref(), $right.data_ref(), $left.len())?;

let mut values = MutableBuffer::from_len_zeroed(($left.len() + 7) / 8);
let lhs_chunks_iter = $left.values().chunks_exact(8);
let lhs_remainder = lhs_chunks_iter.remainder();
let rhs_chunks_iter = $right.values().chunks_exact(8);
let rhs_remainder = rhs_chunks_iter.remainder();
let chunks = $left.len() / 8;

values[..chunks]
.iter_mut()
.zip(lhs_chunks_iter)
.zip(rhs_chunks_iter)
.for_each(|((byte, lhs), rhs)| {
lhs.iter()
.zip(rhs.iter())
.enumerate()
.for_each(|(i, (&lhs, &rhs))| {
*byte |= if $op(lhs, rhs) { 1 << i } else { 0 };
});
});

if !lhs_remainder.is_empty() {
let last = &mut values[chunks];
lhs_remainder
.iter()
.zip(rhs_remainder.iter())
.enumerate()
.for_each(|(i, (&lhs, &rhs))| {
*last |= if $op(lhs, rhs) { 1 << i } else { 0 };
});
};
let data = ArrayData::new(
DataType::Boolean,
$left.len(),
None,
null_bit_buffer,
0,
vec![Buffer::from(values)],
vec![],
);
Ok(BooleanArray::from(Arc::new(data)))
Expand All @@ -68,17 +123,54 @@ macro_rules! compare_op_scalar {
($left: expr, $right:expr, $op:expr) => {{
let null_bit_buffer = $left.data().null_buffer().cloned();

let buffer = (0..$left.len())
.map(|i| $op($left.value(i), $right))
.collect();
let comparison = (0..$left.len()).map(|i| $op($left.value(i), $right));
// same as $left.len()
let buffer = unsafe { MutableBuffer::from_trusted_len_iter_bool(comparison) };

let data = ArrayData::new(
DataType::Boolean,
$left.len(),
None,
null_bit_buffer,
0,
vec![Buffer::from(buffer)],
vec![],
);
Ok(BooleanArray::from(Arc::new(data)))
}};
}

macro_rules! compare_op_scalar_primitive {
($left: expr, $right:expr, $op:expr) => {{
let null_bit_buffer = $left.data().null_buffer().cloned();

let mut values = MutableBuffer::from_len_zeroed(($left.len() + 7) / 8);
let lhs_chunks_iter = $left.values().chunks_exact(8);
let lhs_remainder = lhs_chunks_iter.remainder();
let chunks = $left.len() / 8;

values[..chunks]
.iter_mut()
.zip(lhs_chunks_iter)
.for_each(|(byte, chunk)| {
chunk.iter().enumerate().for_each(|(i, &c_i)| {
*byte |= if $op(c_i, $right) { 1 << i } else { 0 };
});
});
if !lhs_remainder.is_empty() {
let last = &mut values[chunks];
lhs_remainder.iter().enumerate().for_each(|(i, &lhs)| {
*last |= if $op(lhs, $right) { 1 << i } else { 0 };
});
};

let data = ArrayData::new(
DataType::Boolean,
$left.len(),
None,
null_bit_buffer,
0,
vec![buffer],
vec![Buffer::from(values)],
vec![],
);
Ok(BooleanArray::from(Arc::new(data)))
Expand All @@ -96,7 +188,7 @@ where
T: ArrowNumericType,
F: Fn(T::Native, T::Native) -> bool,
{
compare_op!(left, right, op)
compare_op_primitive!(left, right, op)
}

/// Evaluate `op(left, right)` for [`PrimitiveArray`] and scalar using
Expand All @@ -110,7 +202,7 @@ where
T: ArrowNumericType,
F: Fn(T::Native, T::Native) -> bool,
{
compare_op_scalar!(left, right, op)
compare_op_scalar_primitive!(left, right, op)
}

/// Perform SQL `left LIKE right` operation on [`StringArray`] / [`LargeStringArray`].
Expand Down
31 changes: 28 additions & 3 deletions rust/arrow/src/util/bench_util.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,15 @@

//! Utils to make benchmarking easier

use rand::distributions::{Alphanumeric, Distribution, Standard};
use rand::Rng;

use crate::array::*;
use crate::datatypes::*;
use crate::util::test_util::seedable_rng;
use rand::Rng;
use rand::SeedableRng;
use rand::{
distributions::{Alphanumeric, Distribution, Standard},
prelude::StdRng,
};

/// Creates an random (but fixed-seeded) array of a given size and null density
pub fn create_primitive_array<T>(size: usize, null_density: f32) -> PrimitiveArray<T>
Expand All @@ -43,6 +46,28 @@ where
.collect()
}

pub fn create_primitive_array_with_seed<T>(
size: usize,
null_density: f32,
seed: u64,
) -> PrimitiveArray<T>
where
T: ArrowPrimitiveType,
Standard: Distribution<T::Native>,
{
let mut rng = StdRng::seed_from_u64(seed);

(0..size)
.map(|_| {
if rng.gen::<f32>() < null_density {
None
} else {
Some(rng.gen())
}
})
.collect()
}

/// Creates an random (but fixed-seeded) array of a given size and null density
pub fn create_boolean_array(
size: usize,
Expand Down