Content depending chunker #183
Description
There is a need for content dependant chunker.
Therefore I welcome objective, non-opinionated discussion, proofs of concepts, benchmarks, and other around this subject.
Questionable is what technique, what polynomial (if applicable), and what parameters to use for given chunker, so that it is performant across the board and effectively helps everyone.
Also the question is what files/data to chunk by this chunker. Given that compressed data would likely not benefit at all from content dependent chunking. That is unless the file is archive with non-solid compression, or alike. Should this be decided automatically by some heuristics and for say text files that are not minified versions of js/css, etc. to use such chunker, else select regular chunker? By file headers?
This can have a great impact for (distributed) archival of knowledge (think Archive.org, except with dedup, better compression, and that can be distributed easily). Which also does raise question if chunks should be stored compressed. But that is partialy side-tracking this issue.
One reference implementation with focus on storage saving (faster convergence of chunk boundaries):
https://github.com/Tarsnap/tarsnap/blob/master/tar/multitape/chunkify.h
Other references:
https://en.wikipedia.org/wiki/MinHash
https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
https://en.wikipedia.org/wiki/Rolling_hash
https://moinakg.github.io/pcompress/