Move hashing to crc32

### Description

Back in #375, I added sorting by hash to pytest-randomly, using `hashlib.md5()` without much thought. I think it would be better to move to `zlib.crc32` for speed, as it's about 5x faster while still providing great distribution. We don't need a 128 bit hash, 32 bits will do, and we don't need any level of security (although MD5 is now broken).

<details><summary>Benchmarking script</summary>
<p>

```py
from __future__ import annotations

import time
from hashlib import md5
from zlib import crc32


def md5_hash(text: str) -> int:
    return int.from_bytes(md5(text.encode()).digest(), "big")


def crc32_hash(text: str) -> int:
    return crc32(text.encode())


def generate_test_ids(n: int) -> list[str]:
    return [f"test_{i}" for i in range(1, n + 1)]


def benchmark_speed(hash_func, test_ids: list[str], iterations: int = 100) -> float:
    """Benchmark hash function speed in nanoseconds per hash"""
    start = time.perf_counter_ns()
    for _ in range(iterations):
        for test_id in test_ids:
            hash_func(test_id)
    end = time.perf_counter_ns()

    total_hashes = len(test_ids) * iterations
    return (end - start) / total_hashes


def analyze_distribution(hashes: list[int], name: str) -> dict:
    """Analyze statistical properties of hash distribution"""
    return {
        "name": name,
        "min": min(hashes),
        "max": max(hashes),
        "range": max(hashes) - min(hashes),
        "range_usage": (max(hashes) - min(hashes))
        / (2 ** (32 if name == "CRC32" else 128)),
        "unique_count": len(set(hashes)),
        "collision_rate": 1.0 - (len(set(hashes)) / len(hashes)),
    }


def main():
    print("Hash Function Comparison: MD5 vs CRC32")
    print("=" * 50)

    # Generate test data
    test_ids = generate_test_ids(100_000)

    # Speed benchmarks
    print("Speed Benchmarks (average per hash):")
    md5_speed = benchmark_speed(md5_hash, test_ids)
    crc32_speed = benchmark_speed(crc32_hash, test_ids)

    print(f"MD5:   {md5_speed:.1f} ns")
    print(f"CRC32: {crc32_speed:.1f} ns")
    print(f"CRC32 is {md5_speed / crc32_speed:.1f}x faster than MD5")
    print()

    # Generate hashes for analysis
    md5_hashes = [md5_hash(test_id) for test_id in test_ids]
    crc32_hashes = [crc32_hash(test_id) for test_id in test_ids]

    # Statistical analysis
    md5_stats = analyze_distribution(md5_hashes, "MD5")
    crc32_stats = analyze_distribution(crc32_hashes, "CRC32")

    print("Statistical Analysis:")
    print("-" * 30)

    for key in [
        "range",
        "range_usage",
        "unique_count",
        "collision_rate",
    ]:
        print(f"{key:20}: MD5={md5_stats[key]:>12.2f}, CRC32={crc32_stats[key]:>12.2f}")


if __name__ == "__main__":
    main()
```

</p>
</details> 

**Results:**

```
Hash Function Comparison: MD5 vs CRC32
==================================================
Speed Benchmarks (average per hash):
MD5:   434.5 ns
CRC32: 89.8 ns
CRC32 is 4.8x faster than MD5

Statistical Analysis:
------------------------------
range               : MD5=340276881851018210237546467185869717504.00, CRC32=4294733938.00
range_usage         : MD5=        1.00, CRC32=        1.00
unique_count        : MD5=   100000.00, CRC32=   100000.00
collision_rate      : MD5=        0.00, CRC32=        0.00
```

It seems good, both end up evenly distributing 100k test IDs with no collisions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move hashing to crc32 #686

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Move hashing to crc32 #686

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions