You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is not so much a bug as general observations and recommendations about using this chunker. I'm putting this here since this chunker is widely copied by other chunker implementations for other languages, and people copying this probably also want to know this. I'm not sure where else this kind of info should go. Apologies if this is annoying. This comes out of some testing and analysis I did of chunker algorithms here;
Observations related to this FastCDC implementation are;
The average argument is not really the average block length. The actual average is very close to minimum + average/2 (for reasons explained below).
This implementation changes the FastCDC "transition point" to max(0, average - 1.5*minimum). This means that the transition point is less than minimum unless minimum < 0.4*average.
FastCDC gets most of its speed advantages from "cut-point-skipping" past minimum, and recommends minimum = 0.5*average or even minimum=average. This means for the FastCDC recommended minimum settings that give good speed, the first mask is never used, and only the second mask is used.
When only the second mask is used, FastCDC's "normalized chunking" feature is not used. Since this implementation uses "NC1" with only one bit of normalization, it behaves like a normal exponential chunker with average set to half the value. An exponential chunker's actual average block length (excluding truncation from maximum) is minimum + average, hence this chunker's actual average being minimum + average/2 (provided maximum is set large enough to minimize truncations).
Not using "normalized chunking" is probably a good thing anyway, since my testing suggests that is worse than simple exponential chunking when you compare using the same actual average and minimum chunk sizes.
With all of that in mind, based on my testing, for this chunker the best minimum/average/maximum settings for optimal deduplication with good speed would be minimum=average/2, maximum>2.5*average with an average chunk length of average. For faster chunking speed with slightly reduced deduplication use minimum=average, maximum>3*average with an average chunk length of 1.5*average. Unfortunately this chunker throws an error if minimum >= average, so it doesn't let you do that. However, you can get away with setting minimum = average -1.
The text was updated successfully, but these errors were encountered:
This is not so much a bug as general observations and recommendations about using this chunker. I'm putting this here since this chunker is widely copied by other chunker implementations for other languages, and people copying this probably also want to know this. I'm not sure where else this kind of info should go. Apologies if this is annoying. This comes out of some testing and analysis I did of chunker algorithms here;
https://github.com/dbaarda/rollsum-chunking/blob/master/RESULTS.rst
Observations related to this FastCDC implementation are;
average
argument is not really the average block length. The actual average is very close tominimum + average/2
(for reasons explained below).max(0, average - 1.5*minimum)
. This means that the transition point is less thanminimum
unlessminimum < 0.4*average
.minimum
, and recommendsminimum = 0.5*average
or evenminimum=average
. This means for the FastCDC recommendedminimum
settings that give good speed, the first mask is never used, and only the second mask is used.average
set to half the value. An exponential chunker's actual average block length (excluding truncation frommaximum
) isminimum + average
, hence this chunker's actual average beingminimum + average/2
(providedmaximum
is set large enough to minimize truncations).With all of that in mind, based on my testing, for this chunker the best
minimum/average/maximum
settings for optimal deduplication with good speed would beminimum=average/2, maximum>2.5*average
with an average chunk length ofaverage
. For faster chunking speed with slightly reduced deduplication useminimum=average, maximum>3*average
with an average chunk length of1.5*average
. Unfortunately this chunker throws an error ifminimum >= average
, so it doesn't let you do that. However, you can get away with settingminimum = average -1
.The text was updated successfully, but these errors were encountered: