Skip to content
This repository was archived by the owner on Feb 8, 2023. It is now read-only.
This repository was archived by the owner on Feb 8, 2023. It is now read-only.

Content depending chunker #183

Open
Open
@donothesitate

Description

@donothesitate

There is a need for content dependant chunker.

Therefore I welcome objective, non-opinionated discussion, proofs of concepts, benchmarks, and other around this subject.

Questionable is what technique, what polynomial (if applicable), and what parameters to use for given chunker, so that it is performant across the board and effectively helps everyone.

Also the question is what files/data to chunk by this chunker. Given that compressed data would likely not benefit at all from content dependent chunking. That is unless the file is archive with non-solid compression, or alike. Should this be decided automatically by some heuristics and for say text files that are not minified versions of js/css, etc. to use such chunker, else select regular chunker? By file headers?

This can have a great impact for (distributed) archival of knowledge (think Archive.org, except with dedup, better compression, and that can be distributed easily). Which also does raise question if chunks should be stored compressed. But that is partialy side-tracking this issue.

One reference implementation with focus on storage saving (faster convergence of chunk boundaries):
https://github.com/Tarsnap/tarsnap/blob/master/tar/multitape/chunkify.h

Other references:
https://en.wikipedia.org/wiki/MinHash
https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
https://en.wikipedia.org/wiki/Rolling_hash
https://moinakg.github.io/pcompress/

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions