Skip to content
Franco Corbelli edited this page Sep 11, 2024 · 3 revisions

zpaqfranz can choose (almost everywhere) different algorithms for hashing/checksumming. Each of them has differences in speed and reliability, with the possibility of using HW acceleration (if the CPU supports it).

Why so many choices?
Because it essentially depends on the speed of reading from the media and eventually back-compatibility. With an NVMe capable of reading 2000MB/s using SHA256 will cap to a maximum of 300MB/s. But if you use a magnetic disk with a maximum speed of 200MB/s you will NOT reduce the time by using a faster (but less reliable) algorithm.

Important note: when using the new zpaqfranz functions, capable of generating one thread per folder to be processed (-all), the overall speed can be much higher than the maximum sustainable from the media

Short version: (currently mid 2024) I suggest -xxhash

for detecting unwanted corruption: fast and small. If you are "paranoid" go cryptographic-level: sha3, sha256, blake3, whirlpool

Rough estimate: speed reported is merely indicative, for an AMD 7950X3D only a rough indication of relative performance, on Windows 64, v60.7f

   WHIRLPOOL:   198.29 MB/s 
     SHA-256:   453.34 MB/s (software implementation)
       SHA-3:   467.22 MB/s 
         MD5:   833.36 MB/s 
       SHA-1:   972.29 MB/s (software implementation)
   HIGHWAY64:     1.50 GB/s (google's)
      BLAKE3:     4.02 GB/s (HW accelerated)
    XXHASH64:     6.27 GB/s (the default)
     CRC-32C:     6.89 GB/s (something like CRC-32)
        XXH3:     7.37 GB/s 
      CRC-32:     8.52 GB/s (done    42.45 GB)
      WYHASH:     8.54 GB/s (experimental, just for reference)
    NILSIMSA:     8.55 GB/s (not an hash)

     SHA-256:     1.96 GB/s (hardware implementation)
       SHA-1:     2.05 GB/s (hardware implementation)

-sha1

Wikipedia
Fair speed (~900MB/s), very reliable.
Collisions have been found, albeit in very special and limited cases.
On CPUs that support hardware acceleration for AES instructions, the new versions of zpaqfranz automatically speed up to around 2GB/s. AMD CPUs, even those from older generations, have these instructions. The same goes for Intel mobile CPUs. However, only the latest generations of Intel desktop CPUs can benefit from them. This is not something that depends on me.

-xxhash

Home
The XXHASH-64 bit, zpaqfranz 52's default (because it is smaller than 128-bit)
Very fast (~5000MB/s), it is thought to be reliable. I preferred it over the 128-bit version because on 32-bit CPUs [or on systems such as ESXi, for example], it is much faster. In short, it's a compromise, not the best choice in every situation

-xxh3

Home
The XXH3-128 bit.
Very fast (~7000MB/s), it is thought to be reliable. It is usually the hash algorithm I prefer, and it is used internally by zpaqfranz when a data verification is requested

-crc32

Wikipedia
The ancient but ubiquitous CRC-32.
Very fast (~9000MB/s), reliable for detecting corruption, non so much for collisions. It has some special features, in the zpaqfranz default, being able to recalculate the CRC-32 of compressed files without actually decompressing them. In short, it's not a very robust algorithm, it's not hacker-proof. However, for 'real' data storage errors, it is still valid and widely used almost everywhere.

-crc32c

Wikipedia
The "Castagnoli" version, with HW acceleration.
Fastest (~7000MB/s), reliable for corruption, not for collisions

-blake3

Wikipedia
CPU intensive (on Win 64 runs with HW acceleration), but very reliable.
On Intel CPUs can be faster then SHA-256 (without HW).
Please note: current implementation does NOT use multithread. Maybe in the future...

-sha256

Wikipedia
CPU intensive (~290MB/s), but the maybe the most reliable.
In Europe it constitutes legal proof.
On CPUs that support hardware acceleration for AES instructions, the new versions of zpaqfranz automatically speed up to around 2GB/s. AMD CPUs, even those from older generations, have these instructions. The same goes for Intel mobile CPUs. However, only the latest generations of Intel desktop CPUs can benefit from them. This is not something that depends on me.

-sha3 (256 bit)

Wikipedia
The latest NIST standard, very different internally from SHA2-256.
Typically faster than software-only SHA-256 (450MB/s). Very, very strong. It is used in zpaqfranz for CPU testing (zpaqfranz b -all -n 999), where it very quickly brings consumption and thermal dissipation to the maximum (in just a few seconds). If your CPU can handle 1000 seconds at maximum power, you almost certainly won't have issues.

UNLESS it is one of the 13th or 14th generation Intel processors with the old, overly aggressive power algorithm (mid-2024). WARNING: this is an extreme test, use it with caution.

Intel The default version runs for 5 seconds and should not cause problems, even on systems not designed for heavy and sustained workloads.

-whirlpool

Wikipedia
Very CPU intensive (~180MB/s), but very, very, very reliable.
512-bit (64 byte) output. NOT made by NSA (if you do not like :) Its main characteristic is that it is based on a completely different 'logic' compared to more or less standard hash algorithms. It is something (vaguely) similar to an AES encryption algorithm. Therefore, being based on a completely different technology, it is a good choice for having TWO hashes of the same data for forensic preservation purposes

-md5

Wikipedia
Today MD5 is broken as a cryptographic hash function, works great as checksum to verify unintentional corruption. Very common, widespread usage (and that's why it is here, ~800MB/s)

-highway64 -highway128 -highway256

They are various versions of an algorithm written by Google collaborators. They are used in zpaqfranz to test certain FRANZOPACKETs. Essentially for the purpose of studying and developing the program, rather than being useful for the user. There are 3 different versions: 64, 128, and 256 bits. Only the 64-bit version is compatible with BIG ENDIAN, so be careful when using it on 'unusual' CPUs (PowerPC, etc.)

NILSIMSA

Wikipedia NILSIMSA is the opposite of an hash (!)
It is a very rough measure of SIMILARITY and not of DIFFERENCE between files. It is used to identify emails that differ by only a few details (the .eml files that contain them). The goal is to determine if, in large quantities of emails (typically hundreds of thousands), there are many duplicates. This cannot be done with a 'true' hash, where even a single bit of difference results in a completely different hash value.

WYHASH

It is a very weak hash algorithm, with no proof of its actual quality, and a whole series of serious issues in its use. In short, it's essentially for use on really underpowered CPUs. Don't use it for anything more than just for fun

Clone this wiki locally