Skip to content

Voodoo zpaq level 0

Franco Corbelli edited this page Sep 19, 2023 · 1 revision

zpaqfranz is a compression program that uses deduplication techniques to reduce the size of data, a fork of zpaq 7.15.

The actual operation is actually much more complex than how it is described: there is a PDF document in which the author (Matt Mahoney) describes it accurately. This is a mere introduction. Maybe I'll write various levels of detail for those who want to delve deeper

Level 0

Deduplication is the process of identifying and eliminating duplicate data, which can significantly reduce storage requirements.

zpaqfranz achieves deduplication by using a combination of hashing and compressing (and encrypting, if selected).
Here's how it works:

  • Hashing: first divides the input data into (usually) fixed-size fragments (64KB) using a rolling hash function. Each fragment is then hashed using a cryptographic hash function (SHA-1). The hash value serves as a unique identifier for the chunk: if it finds a match then the fragment is not stored. Deduplication requires 1 MB of memory per GB of deduplicated but uncompressed archive data to update, and 0.5 MB per GB to list or extract.

  • Indexing: maintains an index of all the hash values and their corresponding original positions in the input data. This index allows zpaq to quickly identify duplicate blocks: same hash => same data.

  • Compressing: When zpaqfranz encounters a chunk that has already been seen before, instead of storing the entire block again, it uses the index of previous chunk . This significantly reduces the amount of storage required for duplicate data. Otherwise create a new compressed "chunk" (with various algorithms). Data writing occurs in a strictly sequential manner: a single seek is expected at the initial part. Therefore the writing speed is high even on rotating hard disk drives (if not fragmented)

  • Reconstruction: When decompressing the data, zpaqfranz start rebuilding the data retrieving the duplicate blocks from their original positions writing in original position. In this phase the writing (if the specific -stdout switch is not used) is carried out randomly, i.e. not ordered. This means that a file is gradually enlarged by the addition of reconstructed blocks. If they are not in sequence, the intermediate spaces are filled with zeros which are subsequently replaced with data. Therefore the recovery of large files (e.g. virtual machine disks) on magnetic drives (rotating hard disks) can be slowed down by rotational latency. This does not happen for SSD drives, or solid state drives in general.

Overall, zpaq's deduplication process helps to eliminate redundant data and reduce storage requirements. By identifying and storing only unique blocks, zpaqfranz can achieve high levels of compression and efficient storage utilization.

Clone this wiki locally