Skip to content

Hasher::write should clarify its "whole unit" behaviour #94026

Closed
@scottmcm

Description

@scottmcm

Inspired by https://users.rust-lang.org/t/hash-prefix-collisions/71823/10?u=scottmcm

Hash::hash_slice has a bunch of text clarifying that h.hash_slice(&[a, b]); h.hash_slice(&[c]); is not guaranteed to be the same as h.hash_slice(&[a]); h.hash_slice(&[b, c]);.

However, Hasher::write is unclear whether that same rule applies to it. It's very clear that .write(&[a]) is not the same as .write_u8(a), but not whether the same sequence of bytes to write is supposed to be the same thing, even if they're in different groupings, like h.write(&[a, b]); h.write(&[c]); vs h.write(&[a]); h.write(&[b, c]);.

This is important for the same kind of things as the VecDeque example mentioned on hash_slice. If I have a circular byte buffer, is it legal for its Hash to just .write the two parts? Or does it need to write_u8 all the individual bytes since two circular buffers should compare equal regardless of where the split happens to be?

Given that Hash for str and Hash for [T] are doing prefix-freedom already, it feels to me like write should not be doing it again.

Also, our SipHasher implementation is going out of its way to maintain the "different chunking of writes is fine":

fn write(&mut self, msg: &[u8]) {
let length = msg.len();
self.length += length;
let mut needed = 0;
if self.ntail != 0 {
needed = 8 - self.ntail;
// SAFETY: `cmp::min(length, needed)` is guaranteed to not be over `length`
self.tail |= unsafe { u8to64_le(msg, 0, cmp::min(length, needed)) } << (8 * self.ntail);
if length < needed {
self.ntail += length;
return;
} else {
self.state.v3 ^= self.tail;
S::c_rounds(&mut self.state);
self.state.v0 ^= self.tail;
self.ntail = 0;
}
}
// Buffered tail is now flushed, process new input.
let len = length - needed;
let left = len & 0x7; // len % 8
let mut i = needed;
while i < len - left {
// SAFETY: because `len - left` is the biggest multiple of 8 under
// `len`, and because `i` starts at `needed` where `len` is `length - needed`,
// `i + 8` is guaranteed to be less than or equal to `length`.
let mi = unsafe { load_int_le!(msg, i, u64) };
self.state.v3 ^= mi;
S::c_rounds(&mut self.state);
self.state.v0 ^= mi;
i += 8;
}
// SAFETY: `i` is now `needed + len.div_euclid(8) * 8`,
// so `i + left` = `needed + len` = `length`, which is by
// definition equal to `msg.len()`.
self.tail = unsafe { u8to64_le(msg, i, left) };
self.ntail = left;
}

So it seems to me like this has been the expected behaviour the whole time. And if not, we should optimize SipHasher to be faster.

cc #80303 which lead to this text in hash_slice.

Metadata

Metadata

Assignees

No one assigned

    Labels

    T-libs-apiRelevant to the library API team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions