Bloom frequency table based garbage collector #130
Description
Current garbage collector in ipfs is really simple: something either is pinned or thrown out.
It isn't really garbage collector in this state, it is just "clear cache" command.
Here is an idea for a garbage collector that would probabilistically remove unused data leaving used in ipfs repo.
freq
is 1MiB byte array filled with zeros at the start. This array is stored to disk on shutdown and periodically but its loss doesn't really matter, meaning that for the most time the overhead is minimal.
After each garbage collection round (and at the begging) a nonce
is chosen (32bits should be enough) which is random.
When block from disk is accessed its IPFS hash is hashed with nonce using the MurmurHash3 32bit (52 x86 instructions, 5GiB/s throughput- more info here) and truncated to 20 bits. This 20 bits number denotes index in the freq
array. Byte under this index is incremented up to 255 in which case it stays that way.
When repo runs out of space or user triggers manual garbage collection the removal threshold has to be calculated. Zeros are discarded from freq
array, rest is sorted, this way cumulative frequency data is obtained. This allows for calculation which value in freq
array is threshold for given percentage of space freed (there might be better way of doing it). The requested percentage will be depending on user input (the user can decide how much space should be freed) or automatic calculations (settings and space quotas) if automatic garbage collection is being performed.
Then the block storage is being walked and each block is first checked against pins, then its MurMur hash is checked against freq
array. If it is lower than decided threshold the block is discarded.
This garbage collection algorithm has major benefits over our current one. It will leave frequently used data in cache and remove those rarely used. This might not always be what user requests (example: some rarely used manual) but then there is always pining.
It would also make garbage collection with files API less noticeable, files inside this API will usually be used more frequently used so they won't be garbage collected.