Skip to content

bucket[N] transform computes incorrect hash for decimal values #1981

@yshcz

Description

@yshcz

Apache Iceberg Rust version

None

Describe the bug

A value like -1 has multiple valid two's complement representations:

  • 1 byte: 0xFF
  • 2 bytes: 0xFF 0xFF
  • 3 bytes: 0xFF 0xFF
  • and so on

and the spec requires using the minimum representation for hashing:

Decimal values are hashed using the minimum number of bytes required to hold the unscaled value as a two's complement big-endian

The current implementation converts i128 to bytes using https://doc.rust-lang.org/std/primitive.i128.html#method.to_be_bytes, which always returns a fixed [u8; 16] array. It then tries to find the minimum representation by skipping leading 0x00 bytes:

  let bytes = v.to_be_bytes();  // always 16 bytes
  if let Some(start) = bytes.iter().position(|&x| x != 0) {
      Self::hash_bytes(&bytes[start..])
  } ...

For -1, to_be_bytes() returns [0xFF; 16]. Since no byte is 0x00, all 16 bytes are hashed instead of just [0xFF].

This causes hash mismatches with Java and Python implementations. For example, rows written by Spark with bucket[N] partitioning on a decimal column will be assigned to different buckets by iceberg-rust, causing incorrect partition pruning or writes to wrong files.

A possible fix is to use num_bigint:

use num_bigint::BigInt;

fn hash_decimal(v: i128) -> i32 {
    let bytes = BigInt::from(v).to_signed_bytes_be();
    Self::hash_bytes(&bytes)
}

But this allocates a Vec<u8> on every call. Would be better to compute the minimum representation directly from the [u8; 16]

To Reproduce

PyIceberg:

from decimal import Decimal
from pyiceberg.utils.decimal import decimal_to_bytes
import mmh3

print(mmh3.hash(decimal_to_bytes(Decimal(1))))   # -463810133
print(mmh3.hash(decimal_to_bytes(Decimal(-1))))  # -43192051

iceberg-rust:

#[test]
fn test_hash_decimal_with_negative_value() {
    assert_eq!(Bucket::hash_decimal(1), -463810133);  // passes
    assert_eq!(Bucket::hash_decimal(-1), -43192051);  // fails
}

Expected behavior

No response

Willingness to contribute

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions