Description
Capturing it here since Twitter is not the best place for research discussions.
From - Brandon Falk - @gamozolabs - Aug 8 (https://twitter.com/gamozolabs/status/1292324287324381184)
Let's talk a bit about fuzzer benchmarking. I'll start off with some data directly from fuzzbench, of aflfast
fuzzing libjpeg-turbo
. Here's what the 20 runs of aflfast
look like. https://fuzzbench.com/reports/2020-08-03/index.html
According to fuzzbench, the green line here is a better performing fuzzer. There's a 20x variance in the time-to-same-coverage between the exact same fuzzer, in our n=20 sample set.
Even though this is 20 of the exact same fuzzer, against the exact same binary, we see there is catastrophic variance. Further, we're using things like means and standard deviations to find statistical significance in these fuzzers.
It is apparent from this graph that this data (mainly, the last data points) are not even remotely normally distributed. They are clustered around largely 2 different points, ~3400, and ~3800. Using means and standard deviations here is statistically invalid.
This is very common across fuzzing. A hard block is randomly preventing the fuzzer from progressing, once the path is randomly taken, coverage avalanches for some time, leading to a new temporary plateau.
The variance in these runs easily exceeds 5-10% just in per-run noise and we're using this data to determine when something is significant because it's 5% better. It's important to note, with this graph, all but maybe 2 of these 20 lines are even close to the mean value.
Further, since all of this data is from wall-clock-time, there is variance which is introduced from just timing differences in the fuzzers and what they feed back. Here's AFL on the same binary 10 times. We can see, just in wall-clock time alone they have a 20% spread.
This means, that from just the random noise of the fuzzer picking up "unlucky", maybe expensive, inputs early. The fuzzer performance is permanently affected through the lifetime of the fuzzer. Just from fuzzer noise and timing noise there's easily 20-30% variance between runs.
This means, a 24 hour run may be equal to a 19 hour run of the exact same fuzzer. This isn't even factoring in differences between different fuzzers. When you factor that in, the variance easily jumps above 2x.
Even though you only care about the wall clock time at the end of the day, these fuzzers are so inefficient that it doesn't matter. If AFL gets 20% more performance, it's extremely unlikely that performance isn't portable to other mutation strategies.
Ultimately, I want to know the algorithmic performance of a fuzzer (eg, coverage over cases, not time), and from here, I can figure out if the slow-but-better algorithms can simply be upgraded with better code.
We see this with honggfuzz where it's using malloc() during mutation passes where it is entirely avoidable. Most fuzzers do stuff like this, they're not optimized at all. Every fuzzer I know of splices inputs by copying the last part of the input further out.
Using basic data structures like interval trees you could splice data without having to copy the remainder. There are so many improvements to fuzzer performance that it should not be a factor of benchmarks.
Case in point, AFL-qemu massively underperforms every fuzzer according to fuzz bench. There is no data provided, even in the raw CSV, to control for the performance differences.
We shouldn't be using overall performance to determine statistical significance of mutator/feedback design. We're bringing non-athletes to the olympics and trying to figure out which country is better. If we find a better fuzzing strategy, the perf likely can be solved.
TL;DR: We're comparing fuzzers with 50%+ performance differences, 20%+ same-fuzzer variance, with different feedback mechanisms, often on cloud nodes where I/O is affected by neighbors, with two schedulers (OS+hypervisor), and we're trying to find 5% improvements in mutators...
... using normal distribution statistical significance on non-normal data. And we're ignoring the first 15 minutes of the fuzzer where 95% of the coverage is found.
For sure, I'm still gathering evidence to make a much stronger case just on... getting a good feeling of fuzzers, I guess? I seem to hit some brick walls when trying to convince others that 5% improvements are quite minor, as well as just general cleanliness of test environs.
I'm trying to make a bit more comprehensive list of examples. Demonstrating what a bespoke fuzzer performance looks like compared to generic mutations (often 2-5x for structured inputs), how to evaluate fuzzers in various dimensions (speed, complexity, scalability, etc)
I feel like I'm getting close to being comfortable demonstrating many points that I've taken for granted as intuition for the past ~6 years of working with fuzzers.
Activity