-
Notifications
You must be signed in to change notification settings - Fork 388
Description
Apache Iceberg Rust version
None
Describe the bug
A value like -1 has multiple valid two's complement representations:
- 1 byte: 0xFF
- 2 bytes: 0xFF 0xFF
- 3 bytes: 0xFF 0xFF
- and so on
and the spec requires using the minimum representation for hashing:
Decimal values are hashed using the minimum number of bytes required to hold the unscaled value as a two's complement big-endian
The current implementation converts i128 to bytes using https://doc.rust-lang.org/std/primitive.i128.html#method.to_be_bytes, which always returns a fixed [u8; 16] array. It then tries to find the minimum representation by skipping leading 0x00 bytes:
let bytes = v.to_be_bytes(); // always 16 bytes
if let Some(start) = bytes.iter().position(|&x| x != 0) {
Self::hash_bytes(&bytes[start..])
} ...For -1, to_be_bytes() returns [0xFF; 16]. Since no byte is 0x00, all 16 bytes are hashed instead of just [0xFF].
This causes hash mismatches with Java and Python implementations. For example, rows written by Spark with bucket[N] partitioning on a decimal column will be assigned to different buckets by iceberg-rust, causing incorrect partition pruning or writes to wrong files.
A possible fix is to use num_bigint:
use num_bigint::BigInt;
fn hash_decimal(v: i128) -> i32 {
let bytes = BigInt::from(v).to_signed_bytes_be();
Self::hash_bytes(&bytes)
}But this allocates a Vec<u8> on every call. Would be better to compute the minimum representation directly from the [u8; 16]
To Reproduce
PyIceberg:
from decimal import Decimal
from pyiceberg.utils.decimal import decimal_to_bytes
import mmh3
print(mmh3.hash(decimal_to_bytes(Decimal(1)))) # -463810133
print(mmh3.hash(decimal_to_bytes(Decimal(-1)))) # -43192051iceberg-rust:
#[test]
fn test_hash_decimal_with_negative_value() {
assert_eq!(Bucket::hash_decimal(1), -463810133); // passes
assert_eq!(Bucket::hash_decimal(-1), -43192051); // fails
}Expected behavior
No response
Willingness to contribute
None