-
Notifications
You must be signed in to change notification settings - Fork 992
[Variant] Optimize the object header generation logic in ObjectBuilder::finish #8031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Variant] Optimize the object header generation logic in ObjectBuilder::finish #8031
Conversation
@alamb @scovich @viirya Please help review this when you're free. thanks. I've tried to benchmark three implementations
append_packed_u32 into a tmp buf
PackedU32Iterator with extend
The results show that the first win(but not too much).
PackedU32Iterator
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(if is_large { 4 } else { 1 }) + // num_fields | ||
(num_fields * id_size as usize) + // field IDs | ||
((num_fields + 1) * offset_size as usize); // field offsets + data_size | ||
let num_fileds_size = if is_large { 4 } else { 1 }; // is_large: 4 bytes, else 1 byte. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable name seems to have a typo:
let num_fileds_size = if is_large { 4 } else { 1 }; // is_large: 4 bytes, else 1 byte. | |
let num_fields_size = if is_large { 4 } else { 1 }; // is_large: 4 bytes, else 1 byte. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also: I personally find the comment unhelpful -- it just restates the code.
/// An iterator that yields the bytes of a packed u32 iterator. | ||
/// Will yield the first `packed_bytes` bytes of each item in the iterator. | ||
struct PackedU32Iterator<T: Iterator<Item = [u8; 4]>> { | ||
packed_bytes: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think to get this really fast we will need to use generics based on the packed_bytes size -- so we would end up with different versions of the code for 1,2 and 4 byte offsets
We could try to make this particular iterator generic, but it might get a bit messy.
I was thinking maybe we can somehow structure the code so there is a function like this that writes the header for a certain offset size:
fn write_header<const SIZE: usize>(dst: &mut Vec<u8>, ...) {
...
}
Then we would basically have a switch like this to instantiate the appropriate versions (can probably avoid the panic)
match int_size(max_id as usize) {
1 => write_header::<1>(dst, ...),
2 => write_header::<2>(dst, ...),
4 => write_header::<4>(dst, ...),
_ => panic!("unsupported size")
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even just specializing the PackedU32Iterator with a const N: usize
parameter would probably give most of the benefit, for this iterator-based approach?
However -- the temp buf append+truncate is fundamentally faster on a per-entry basis, because of how much simpler it is at machine code level. And it will run at exactly the same speed for all packing sizes, because there are no branches or even conditional moves based on the packing size (**). The only downside is the up-front cost of allocating said temp buffer first, which has to amortize across all entries.
(**) If the compiler were a bit smarter, it could even eliminate the bound checks in the loop's push
calls (based on the result of the with_capacity
that preceded it), and also eliminate the bounds checks in the loop's truncate
calls (based on the fact that the arg passed to truncate is never larger than the vec size).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked into the variant implementation or this PR in detail, but does this iterator need a custom implementation or can it be an anonymous impl Iterator<Item=u8>
? It looks like the implementation is basically iter.flat_map(u32::to_le_bytes)
. The benefit of that is that if the underlying iterator has a trusted length then the flat-mapped iterator is still trusted, and extending a slice from a TrustedLen
iterator is much more efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jhorstmann IIUC, the benefit of this iterator is size_hint
, Rust can using the size_hint
do do better staff, as it knows the size of the iterator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However -- the temp buf append+truncate is fundamentally faster on a per-entry basis, because of how much simpler it is at machine code level. And it will run at exactly the same speed for all packing sizes, because there are no branches or even conditional moves based on the packing size (**). The only downside is the up-front cost of allocating said temp buffer first, which has to amortize across all entries.
Yes I agree with this analysis.
This is why my opinion still remains that the way to avoid the allocation is to calculate the header size and shift the bytes first so we write directly to the target. I realize that is trickier code, but as @klion26 has said somewhere else I think we could write some good assertions and tests to avoid problems
When I next get a chance to work on Variant related things with time, I want to focus on unsticking the shredding implementation. Once that is done, I may return here to try and restructure things
One thought I had was to encapsulate the logic / templating in a
struct ListHeaderWriter {
offsets: &[usize],
}
impl ListHeaderWriter {
fn header_size::<const OFFSET_SIZE:usize>(&self) -> usize {
// put the header size calculation here
}
// write the header into dst starting at start_offset
fn write(dst: &mut Vec<u8>, start_offset: usize)::<const OFFSET_SIZE:usize>(&self) {
...
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems you are right, FlatMap
would only work for constant length arrays, but not dynamic slices with length based on packed_bytes
. And TrustedLen
optimization also can not kick in since the base iterator is from a crate outside the standard library.
Thanks for taking the time to do this!
Technically, the first two are equivalent because each result's mean is comfortably inside the other's confidence window. While it may be suggestive that the temp buf approach has a higher mean in all four results, any real difference is dwarfed by noise:
(aside: why would Meanwhile, it's unsurprising in retrospect that the complex iterator approach would be slower than the append+truncate approach, given how simple (branchless!) the latter's machine code is.
I mean, yes we can. But given a choice to eliminate a bug surface entirely from the code base, vs. try to validate it with unit tests that could be incomplete and/or go out of sync? I almost always favor eliminating the possibility of bugs over testing for them -- unless the perf and/or complexity difference is large enough to make me question that default choice.
If we care about the performance (and correctness) of that case, it seems like we need to have benchmarks and unit tests to cover it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this appears somewhat complicated for what it is doing and doesn't really make a huge difference
If we want to measure the cost of safety, it might be instructive to modify append_offset_array_start_from_buf_pos
to take the same approach as append_packed_u32
in writing more bytes than we keep -- something like:
let mut current_pos = start_pos;
for relative_offset in offsets {
- write_offset_at_pos(buf, current_pos, relative_offset, nbytes);
+ write_u32_at_pos(buf, current_pos, relative_offset as u32);
current_pos += nbytes as usize;
}
// Write data_size
if let Some(data_size) = data_size {
// Write data_size at the end of the offsets
write_offset_at_pos(buf, current_pos, data_size, nbytes);
current_pos += nbytes as usize;
}
where
// always writes the four LE bytes of a value, relying on the caller to only advance
// the buffer's position by the number of bytes that should be kept (1..=4).
//
// WARNING: Caller must ensure that it's safe to produce as many as 3 extra bytes
// without corrupting the buffer. Usually by ensuring that something else overwrites
// the space immediately after.
fn write_u32_at_pos(buf: &mut [u8], start_pos: usize, value: u32) {
let bytes = value.to_le_bytes();
buf[start_pos..start_pos + bytes.len()].copy_from_slice(&bytes);
}
Highly dangerous -- it relies on the fact that the field array is followed by the value array, and the value array is followed followed by the data size (which is written in the safer/slower way) -- but we could at least see whether it gives enough performance boost to consider keeping it.
IMO, if want to keep the current unsafe approach that uses offset arithmetic in a pre-allocated buffer... we may as well go all the way and maximize the performance benefit we gain by giving up safety.
/// An iterator that yields the bytes of a packed u32 iterator. | ||
/// Will yield the first `packed_bytes` bytes of each item in the iterator. | ||
struct PackedU32Iterator<T: Iterator<Item = [u8; 4]>> { | ||
packed_bytes: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even just specializing the PackedU32Iterator with a const N: usize
parameter would probably give most of the benefit, for this iterator-based approach?
However -- the temp buf append+truncate is fundamentally faster on a per-entry basis, because of how much simpler it is at machine code level. And it will run at exactly the same speed for all packing sizes, because there are no branches or even conditional moves based on the packing size (**). The only downside is the up-front cost of allocating said temp buffer first, which has to amortize across all entries.
(**) If the compiler were a bit smarter, it could even eliminate the bound checks in the loop's push
calls (based on the result of the with_capacity
that preceded it), and also eliminate the bounds checks in the loop's truncate
calls (based on the fact that the arg passed to truncate is never larger than the vec size).
(if is_large { 4 } else { 1 }) + // num_fields | ||
(num_fields * id_size as usize) + // field IDs | ||
((num_fields + 1) * offset_size as usize); // field offsets + data_size | ||
let num_fileds_size = if is_large { 4 } else { 1 }; // is_large: 4 bytes, else 1 byte. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also: I personally find the comment unhelpful -- it just restates the code.
let header = object_header(is_large, id_size, offset_size); | ||
let bytess_to_splice = std::iter::once(header) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let bytess_to_splice = std::iter::once(header) | |
let bytes_to_splice = std::iter::once(header) |
@alamb I think it might be better to stick with the current solution: 1) The solution in the current PR is more complex than the original, but the performance improvement isn't significant; 2) I originally hoped to keep the same implementation for ListBuilder/ObjectBuilder, but it seems that changes to ObjectBuilder might result in performance regressions.
Thanks for your detailed response. I agree with your point. It's more reliable to fundamentally address the potential for errors. This way, even if there's no test to guarantee this logic for any reason in the future, it will continue to run correctly. I'm a little confused about the benchmark. Why is For the unsafe code with @alamb @scovich Finally, thank you for your detailed review and guidance on these PRs. I've learned a lot from them. Thank you again. |
Yes I agree with this plan for now
Thank you again for all the help |
I suspect it's because the compiler can't "see through" the packed iterator like it can with the append+truncate approach. So the former produces 1-4 individual byte appends, each with its own bounds check and length change, while the latter produces a single 32-bit append followed by a truncate.
If you hit a buffer overflow panic, I think that means you tried to use the "fast" version to write the
The same argument likely applies to the object builder. Should we consider reverting that back to match the list builder, for simplicity and maintainability? I'd have to go back and check the other PR, but was the perf improvement actually significant there? If so=> why did it not work for list builder? Do we not have apples-to-apples benchmarking? If not => probably good to revert. |
It would panic in the following test when creating a variant with
Sorry, I've rechecked the current |
header_pos = self | ||
.parent_state | ||
.buffer() | ||
.append_offset_array_start_from_buf_pos( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR, but append_offset_array_start_from_buf_pos
also writes field ids instead of just offsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change the function name after this pr finalized.
} | ||
|
||
impl<T: Iterator<Item = [u8; 4]>> PackedU32Iterator<T> { | ||
fn new(packed_bytes: usize, iterator: T) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to add an assert on packed_bytes
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing it out, adding assert is indeed a better behavior
buffer.splice( | ||
starting_offset..starting_offset, | ||
std::iter::repeat_n(0u8, header_size), | ||
let fields = PackedU32Iterator::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly that the current approach looks more easy to understand at the first glance. This PackedU32Iterator
approach is a bit over-abstraction to me. For appending the bytes into the buffer, I'd prefer to keep it simple instead of introducing a new abstraction on it if not too much benefits.
Ah... I didn't think about that. With one-byte offsets, |
Converting to a draft as I think we are leanting towards not merging this one and I want to clean up the PR queue |
Which issue does this PR close?
This pr wants to optimize the logic of
ObjectBuilder::finish
Rationale for this change
This pr wants to optimize the logic of
ObjectBuilder::finish
What changes are included in this PR?
This PR wants to optimize
ObjectBuilder::finish
with packedu3 iteratorAre these changes tested?
This pr was covered by existing test
Are there any user-facing changes?
No