-
Notifications
You must be signed in to change notification settings - Fork 37
Collection of findings #18
Comments
The rabin library we're using has terrible performance but that's unrelated to the algorithm or SIMD: it allocates all over the place. We've added a chunker that uses buzhash and is almost as fast as the default chunker. Given the overhead of generating CIDs, any overhead from chunking using buzhash is now negligible. Try using the latest go-ipfs master and passing
Yes?
Yes?
I'm not sure what you're asking.
Yes. The goal is to have a good general-purpose chunker and a bunch of per-file-type chunkers (tar, media containers, etc.). See: |
Is this the best hash algorithm within a basket of alternate chunking algorithms? If so where is the list of assessment of existing algorithms and library benchmarks? If not, is it possible to start a benchmark?
Phasing out both the old methods of static chunk sizes, and the old method of chunker that allows for different minimum and maximum chunk sizes based on parameters, as people can hash the same file in multiple methods (assuming there is one hashing algorithm but multiple chunking algorithms). So IPLD will handle different file types? Okay. In that case we need to cover as many "common" file formats as possible, that might pose a few issues though... File obfuscation might make a single file have multiple chunking forms depending on how people interprets the files. Pretty much having fun, nice to know |
Almost certainly not. It was just a reasonably simple algorithm to get the job done. @ribasushi will be working on better chunking algorithms.
We'll eventually be changing the default chunking algorithm but we'll always provide options for users that need them.
Eventually, yes. However, we have a ways to go before we're even going to consider tackling this. |
Ask him to rim through all the literature if possible. Better yet web surf repos from GitHub for inspiration.
Standardization is my main worry. |
@DonaldTsang this is definitely the plan, worry not. Check back at ipfs/specs#227 for various PR's / changes to the issue text for progress of scoping the work in the next week or so. |
Does anyone know what the impact/cost of different chunk sizes on IPFS is? Smaller chunks will give much better de-duplication, but at the cost of adding way more blocks. At some point the de-duplication benefits will be outweighed by the per-block overheads. In the opposite direction, large blocks can also have overheads from having to handle/transfer such large blocks. Regardless of what chunking algorithm to use, we need to know this to tune the chunkers. And every time the chunkers are tuned to different values, they will break de-duplication against older chunked data, so it would be good to get these right sooner rather than later. I believe there is a current max block size of 2MB, and the default chunker uses 256KB. This suggests that 256KB is some kind of sweet spot for block sizes assuming a fixed sized chunker. For a variable chunker I would expect the sweet-spot average to be smaller than that to get extra de-duplication benefits. Currently the buzzhash chunker is hard-coded for a min/avg/max chunk size of 128KB/256KB/512KB. I suspect 64KB/128KB/512KB or 64KB/192KB/512KB might be better. However, it really does depend on how the rest of IPFS scales with blocksize. In some ways an upper blocksize of 2MB is quite small... Google's original GFS cluster-filesystem used 64MB as the chunk size. |
@dbaarda very soon I will be publishing a call to the community to play exactly with this type of tuning. You happen to ask this question exactly at the end of a journey that started back in feb to build a platform for evaluating this very question. The tooling for this is nearly ready (I've been saying this for a while, but this time it's for real 😸 ). You could play with it already, but the repository will be force-pushed a couple times this week still, you've been warned. Also there is a number of holes in the helptext, check back in a day, plus there is a much more elaborate writeup coming up very soon on what is and isn't useful to explore. You can follow the instructions here: https://github.com/ipfs-shipyard/DAGger#readme
|
Wow, that's an impressive amount of effort to test chunkers. I probably wouldn't have gone that far :-) I suspect that the chunking algorithms themselves probably won't make much difference, and will mostly all perform pretty similarly, so you probably just want to choose the simplest/smallest/fastest and then standardize on it. Too many different chunkers will break de-duplication more than a single standard chunker that's only average. Far more important is the min/avg/max chunksize they produce. You only add and chunk files once, and then they get fetched and read repeatedly. The chunk size affects the amount of de-duplication and impacts the performance and network overheads every time they are read. IMHO its far more important to find the sweet spot chunk size compromise between deduplication and performance than picking the best chunker. Looking at your example stream-dagger runs above I have a few questions;
Having just said I don't think the actual chunker is going to matter much, I was about to contribute a Polynomial rolling hash chunker (sometimes AKA RabinKarp, but NOT the same as Rabin Fingerprint) :-) It's very similar to buzhash AKA CyclicPoly but a little simpler because it doesn't need a byte mapping table, and in my testing for librsync gave a much better hash. I'm unconvinced that it will make much difference performance wise for chunking, particularly for the tiny 32 byte rolling window (librsync uses it over whole 1K+ blocks), but it has simplicity on it's side. I was going to submit this as a pull request against ipfs/go-ipfs-chunker, but I see in your new ipfs-shipyard/DAGger you've changed the API. If I was going to submit this, where would you like me to submit it? |
It isn't though. First off this doubles as an exploration of how to arrange stream processing to "abstract away" the overhead of chunking. If you do a few tests with larger entities you will notice how things are notably quicker. Additionally it is becoming increasingly clear that for the "thin waist IPLD" concept from way way back to really work, there needs to be a "sufficiently universal" chunking algorithm that ~80% of implementations can simply default to. Global emergent convergence is a cool thing to at least try to attain.
You are correct on algorithms standalone not providing much value. The key here is that combinations of them provide astonishing results. You already intuitively figured out what the padfinder does in the context of tar files, now stretch this further. This is another reason the framework exists: it had to be validated that recursive chunking can work over streams in limited memory AND that it can work with essentially no overhead ( each subchunk happens in a separate thread ) AND that the theoretical results do indeed show up in the aggregation tables.
Bingo, but add a sufficiently advanced chunk-stack to that shopping list ;)
Yes, it's the amount of nodes you need to traverse to reassemble the stream. I will change the text to read
Not exactly, gives me an idea how to rephrase the option names. TLDR: the first chunkers max-es deal with "pad runs", anything between those is simply passed onto the next chunker whatever it may be: https://github.com/ipfs-shipyard/DAGger/blob/824420aeedbc0afbb110ebe179d2532d352f6ebf/internal/dagger/ingest.go#L569-L602
That's part of this exploration. It's just step+2 (currently in the process of wrapping up the tool so folks like yourself can play with it, and then adding the ingestion framework to go-ipfs to make it work faster on "re-hashing" ops, provided the hypothesized numbers check out in a week or so)
You are the exact customer this project has in mind ;) Unfortunately you are still just a wee bit too early: check back Monday-ish when I will have the repo "open for business". But yes - the chunker interface is what we will end up using in the end, and it is pretty stable at this point. Specifically https://github.com/ipfs-shipyard/DAGger/tree/master/internal/dagger/chunker/fixedsize is all you need to implement and then hook it up here: https://github.com/ipfs-shipyard/DAGger/blob/master/internal/dagger/core.go#L33-L39 Thanks for the questions! |
But nearly everything between those "pad runs" is 1.8KB on average which is about the average file size in the linux kernel, and the buzhash those args will not break anything less than 128K into smaller chunks. So there is nothing left big enough after the pad-chunking for the buzhash to chunk any smaller. Note that you can also make buzhash (or polyhash) chunk on zero runs without needing a pad-chunker. You just need to set buzhash_hash-table=GoIPFSv0_state-target=0_state-mask-bits=17_min-size=512_max-size=524288 Which gives a chunk min/avg/max of 512/128K/512K with the "added bonus" that it will chunk on any run of 32 zeros at least 512 bytes apart, which means it will chunk at (nearly) every file boundary in a tar file. |
@dbaarda sorry for the late reply, had to take a pause from all of this.
And this is key: remember, you are not storing "tars". You are first and foremost storing individual files which happen to be organized in higher-order structures ( tar files ). Network-wide convergence will take shape naturally IFF we orient towards "nameable units", as opposed to "just data".
This is why this tool was built - numbers beat a conjecture ;) Original non-tuned padfinder run ( same as earlier )
|
No worries... I'd rather you actually do the work than talk about it anyway :-)
See my summary and thoughts related to that here; https://discuss.ipfs.io/t/draft-common-bytes-standard-for-data-deduplication/6813/10?u=dbaarda TLDR; if you care about identifying/deduping the contents of container-files like tar, perhaps the best solution is to make a file-type aware layer on top that extracts/transcodes/re-packs the contents... but there are reasons why you might not want to do that.
Well... as you pointed out earlier, it's actually more like 500,000 files packed into 3 container files :-) That 7% size saving was at the cost of nearly 50% more logical nodes and unique DAG-PB nodes. If IPFS operations scale O(N^2) on the number of logical nodes, this is not a win. It did reduce the number of unique leaf nodes by about 7% also though. This is why we need to know the impact of of block size and block numbers on IPFS to assess the value of these.
And this 2% saving was at the cost of increasing the logical nodes by nearly 130%.
I worry that too much complexity in the chunkers is going to evolve into a hand-tuned expert system trying to get the best result specific to each and every different file type. In my experience these evolve to a point where people become become too scared to change them, since every attempt to produce better results for one corner-case ends up breaking several unrelated others. The end result is barely better than a simple generic solution, and usually ends up with some degenerate corner-cases left where it's worse. It's also waaay more complicated. Having said that, you've sold me on the idea of a zero-padding chunker together with some kind of buzhash/polyhash chunker. There are lots of cases where zero-padding is used and would make good chunking points, like sparse files, filesystem snapshots, etc. For file types without zero-padding it will not make any difference, and it's very low overhead to include it. Are you ready for me to send a pull request for a polyhash chunker? What should I send it against? |
FTR, I've done some testing and analysis of chunker algorithms after reading the FastCDC paper here; https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst The results were a bit surprising to me. TLDR; |
[...]
Note that buzzhash using a 32byte sliding window will ALWAYS return either 0 or 0xffffffff for a 32byte run of any character, depending on if the hash-table mapping for that character has an even or odd number of 1's. This is one of the buzzhash corner-cases that makes it weaker than eg rabin-fingerprints or PolyHash (AKA RabinKarp). The hashtable-mapping used by IPFS seems to try and have an even distribution of 1's and 0's, so most (all?) mapping values have an even number of 1's and will give a zero result for a 32 byte run of most (all?) byte values. This corner-case can be avoided by using a sliding window size !=32. Alternatively you can take advantage of it to always or never have break points after runs of characters. I have also done some comparisons of rollsums including using 32 byte windows here; https://github.com/dbaarda/rollsum-tests/blob/master/RESULTS.rst |
I want to leave a reference to Microsoft's work from 2012 (yes pretty old), which is the basis for the Data Deduplication feature in Windows Server 2016. This feature is pretty successful as a general purpose deduplication:
|
I did some analysis/comparison of that work against various alternatives here; https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst IMHO that paper has 2 interesting contributions;
|
This repository is no longer maintained and has been copied over to Boxo. In an effort to avoid noise and crippling in the Boxo repo from the weight of issues of the past, we are closing most issues and PRs in this repo. Please feel free to open a new issue in Boxo (and reference this issue) if resolving this issue is still critical for unblocking or improving your usecase. You can learn more in the FAQs for the Boxo repo copying/consolidation effort. |
Here are some of the issues I will reference:
There are five main concerns:
The text was updated successfully, but these errors were encountered: