Switch the fingerprint algo to xxh3_128#1630
Conversation
Replace the per-row Python loops in the DataFrame fingerprinting paths with single-buffer hashing: - pandas: hash the `hash_pandas_object(obj).values` uint64 buffer in one shot instead of round-tripping through `.to_dict()` and an ordered `hash_mapping`; fold column names + dtypes (schema) into the hash so frames with identical values but different schemas no longer collide; keep the path order-sensitive. - polars: hash the `hash_rows().to_numpy()` buffer in one shot instead of `.to_list()` through a per-element `hash_sequence` loop. Both paths route through the existing `_hash_bytes` chokepoint, so the algorithm is unchanged here. The DataFrame digest is deliberately not pinned to a literal (it depends on library-version-specific dtype reprs); coverage is via relational schema-collision, dtype-collision and order-sensitivity tests for both backends. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Swap the single `_hash_bytes` chokepoint from md5 to the non-cryptographic `xxhash.xxh3_128`. xxh3_128 produces a 16-byte digest (24 base64url chars, identical width to the md5 it replaces), so digest width and collision resistance are preserved while throughput on buffer-bound paths rises substantially. Declare `xxhash>=0.8.0` as a core runtime dependency (xxh3_128 was added in 0.8.0); fingerprinting is imported eagerly via the caching adapter, so it must be a hard dependency rather than an optional extra. Add the xxhash BSD-2-Clause attribution to LICENSE. Recompute the portable literal-digest pins (primitives, sequences, mappings, sets, numpy) against xxh3_128. This is a fingerprint-changing release: prior cached fingerprints no longer match and will be recomputed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Maybe just me, but this seems to give no benefit for Compared to #1619, it seems like #1628 actually does the lions share and not the switch to |
Yup. I would look at this differently though:
|
Split 3 of 3 of #1619 (stacked on #1629)
_hash_bytesand tag value types #1628).xxhash.xxh3_128.What this does
Change in
_hash_bytes:hashlib.md5(data).digest()→xxhash.xxh3_128(data).digest().xxh3_128is a non-cryptographic hash designed for speed. It produces a 16-byte (128-bit) digest, the same width as md5, so the base64url-encoded fingerprints stay 24 characters and all downstream interfaces (cache keys,data_versionstrings) are unaffected.Adds
xxhash>=0.8.0as a runtime dependency inpyproject.toml. The BSD 2-Clause license text for python-xxhash is appended toLICENSE.Benchmark
Benchmark code
Plotting code
benchmark_hash_algorithm.pyisolates the algorithm swap from the vectorization (#1629), holding the implementation constant and varying only the hash. The raw algorithm is ~8–27× faster, but the end-to-end gain depends on how much of total time the hash step occupies:pandas DataFrames: ~1× (negligible) at all sizes.
hash_pandas_objecttakes 1–6,400 ms; the hash step takes 0.006–52 ms with md5. Hashing is ≤1.5% of end-to-end time. The algorithm swap is invisible here.polars DataFrames: 1.3–2.3×.
hash_rowsis fast enough that the hash step is a meaningful fraction of end-to-end time. The gain grows with size as the hash step's share increases.numpy arrays: 1.4–6.5×. Peaks at mid sizes (~50k) where
tobytes()is negligible and hashing dominates; at very small sizes per-call overhead limits the gain, at large sizes the memcpy dominates.Scalars, sequences, mappings (not benchmarked in isolation): estimated ~1.4–2×, dominated by per-call overhead and base64 encoding. These are the most frequent fingerprint calls in a typical DAG (node inputs, configs, primitives).
Why this is still worth landing
_hash_bytescall gets faster — the benefit is broadest on the many small-buffer paths (scalars, sequences, configs) even though it's invisible on pandas DataFrames._hash_bytesand tag value types #1628+Vectorize pandas and polars DataFrame hashing #1629 stand alone on md5 with no loss of correctness or vectorization.Testing
Pinned-digest literals in
test_fingerprinting.pyare updated to the new xxh3_128 values. Relational tests (must-differ, must-match, cross-type collision) and the full caching suite pass unchanged.Checklist