|
| 1 | +--- |
| 2 | +sidebar_position: 2 |
| 3 | +--- |
| 4 | + |
| 5 | +# How Codeflash measures code runtime |
| 6 | + |
| 7 | +Codeflash reports benchmarking results that look like this: |
| 8 | + |
| 9 | +```text |
| 10 | +⏱️ Runtime : 32.8 microseconds → 29.2 microseconds (best of 315 runs) |
| 11 | +``` |
| 12 | + |
| 13 | +To measure runtime, Codeflash runs a function multiple times with several inputs |
| 14 | +and sums the minimum time for each input to get the total runtime. |
| 15 | + |
| 16 | +A simplified pseudocode of Codeflash benchmarking looks like this: |
| 17 | + |
| 18 | +```python |
| 19 | +loops = 0 |
| 20 | +min_input_runtime = [float('inf')] * len(test_inputs) |
| 21 | +start_time = time.time() |
| 22 | +while loops <= 5 or time.time() - start_time < 10: |
| 23 | + loops += 1 |
| 24 | + for input_index, input in enumerate(test_inputs): |
| 25 | + t = time(function_to_optimize(input)) |
| 26 | + if t < min_input_runtime[input_index]: |
| 27 | + min_input_runtime[input_index] = t |
| 28 | +total_runtime = sum(min_input_runtime) |
| 29 | +number_of_runs = loops |
| 30 | +``` |
| 31 | + |
| 32 | +The above code runs the function multiple times on different inputs and uses the minimum time for each input. |
| 33 | + |
| 34 | +In this document we explain: |
| 35 | +- How we measure the runtime of code |
| 36 | +- How we determine if an optimization is actually faster |
| 37 | +- Why we measure the timing as best of N runs |
| 38 | +- How we measure the runtime when we run on a wide variety of test cases. |
| 39 | + |
| 40 | +## Goals of Codeflash auto-benchmarking |
| 41 | + |
| 42 | +A core principle of Codeflash is that it makes no assumptions about which optimizations might be faster. |
| 43 | +Instead, it generates multiple possible optimizations with LLMs and automatically benchmarks the code |
| 44 | +on a variety of inputs to empirically verify if the optimization is actually faster. |
| 45 | + |
| 46 | +The goals of Codeflash auto-benchmarking are: |
| 47 | +- Accurately measure the runtime of code |
| 48 | +- Measure runtime for a wide variety of code |
| 49 | +- Measure runtime on a variety of inputs |
| 50 | +- Do all the above on a real machine, where other processes might be running and causing timing measurement noise |
| 51 | +- Finally make a binary decision whether an optimization is faster or not |
| 52 | + |
| 53 | +## Racing Trains as an analogy |
| 54 | + |
| 55 | +Imagine you're a boss at a train company choosing between two trains to runs between San Francisco and Los Angeles. |
| 56 | +You want to determine which train is faster. |
| 57 | + |
| 58 | +You can measure their by timing how long each takes to travel between the two cities. |
| 59 | + |
| 60 | +However, real-life factors affect train speeds: rail traffic, unfavorable weather, hills, and other obstacles. |
| 61 | +These can slow them down. |
| 62 | + |
| 63 | +To settle the contest, you have a driver race the two trains at maximum possible speed. |
| 64 | +You measure the travel times between the two cities for each train. |
| 65 | + |
| 66 | +Train A took 5% less time than Train B. But the driver points out that Train B encountered poor weather, |
| 67 | +making it impossible to draw firm conclusions. Since it's crucial to know which train is truly faster, you need more data. |
| 68 | + |
| 69 | +You ask the driver to repeat the race multiple times. In this scenario, since they have plenty of time, they repeat the race 50 times. |
| 70 | + |
| 71 | +This gives us timing data (in hours) that looks like the following. |
| 72 | + |
| 73 | + |
| 74 | + |
| 75 | +With 100 data points (50 per train), determining the faster train becomes more complex. |
| 76 | + |
| 77 | +The timing data contains noise from various factors: other trains on the tracks, changing weather, and so on. |
| 78 | +This makes it challenging to determine which train is faster. |
| 79 | + |
| 80 | +Here's the crucial insight: timing noise isn't the train's fault. A train's speed is an intrinsic property, |
| 81 | +independent of external hindrances. The noise only adds time—there's no "negative noise" that makes trains go faster. |
| 82 | +Ideally, we'd measure speed with no hindrances at all, giving us clean, noise-free data that shows true speed. |
| 83 | + |
| 84 | + |
| 85 | +In reality, we can't eliminate all noise. Instead, we minimize it by focusing on the "signal"—the train's intrinsic |
| 86 | +speed—rather than the noise from hindrances. By running multiple races, we get multiple data points. Sometimes conditions |
| 87 | +are nearly perfect, allowing the train to reach maximum speed. These minimal-noise runs produce the smallest times—our |
| 88 | +"signal" that reveals the train's true capabilities. We can compare these best times to determine the faster train. |
| 89 | + |
| 90 | +The key is finding each train's minimum time between cities—this closely approximates its maximum achievable speed. |
| 91 | + |
| 92 | +## How Codeflash benchmarks code |
| 93 | + |
| 94 | +This principle of measuring peak performance while minimizing external noise is exactly how Codeflash measures code runtime. |
| 95 | +Computer processors face various sources of noise that can increase function runtime: |
| 96 | + |
| 97 | +- Hardware: cache misses, CPU frequency scaling, etc. |
| 98 | +- Operating system: context switches, memory allocation, etc. |
| 99 | +- Programming language: garbage collection, thread scheduling, etc. |
| 100 | + |
| 101 | +Codeflash minimizes noise by running functions multiple times and taking the minimum time. |
| 102 | +This minimum typically occurs when there are fewest hindrances: the processor frequency is maximal, |
| 103 | +cache misses are minimal, and the operating system is not doing context switches. This approaches the function's true speed. |
| 104 | + |
| 105 | +When comparing an optimization to the original function, Codeflash runs both multiple times and compares their |
| 106 | +minimum times. This gives us the most accurate measurement of each function's intrinsic speed which is our signal, allowing for a |
| 107 | +meaningful comparison. |
| 108 | + |
| 109 | +We've found that running a function multiple times increases the likelihood of getting these "lucky" minimal-noise runs. |
| 110 | +To maximize this, Codeflash runs each function for 10 seconds with a minimum of 5 loops, balancing measurement accuracy with reasonable runtime. |
| 111 | + |
| 112 | +## What happens when there are multiple inputs to a function? |
| 113 | + |
| 114 | +While this approach works well for single inputs, what about multiple inputs? |
| 115 | + |
| 116 | +Now the race runs through multiple stations: Seattle to San Francisco to Los Angeles to San Diego. |
| 117 | +We still need to determine the faster train for this route. |
| 118 | + |
| 119 | +We can only measure times between adjacent stations. |
| 120 | + |
| 121 | +Here is how the timing data looks like (in hours): |
| 122 | + |
| 123 | + |
| 124 | + |
| 125 | +With 300 data points (50 runs × 3 segments × 2 trains) and varying conditions on each segment, |
| 126 | +determining the faster train becomes even more challenging. |
| 127 | + |
| 128 | +Which train is faster? |
| 129 | + |
| 130 | +Our insight about measuring peak performance still applies, but we need to measure each segment separately |
| 131 | +since the track differs between segments due to hills and track curves. |
| 132 | + |
| 133 | + |
| 134 | +We divide the route into segments between stations and measure each train's fastest time per segment. |
| 135 | +We find the minimum time for each segment, then sum these minimums to get the total route time. |
| 136 | +The train with the lowest sum of minimum times is fastest. This approach better captures each train's |
| 137 | +intrinsic speed because measuring shorter segments reduces the chance of encountering noise in that segment, compared to measuring the entire route. |
| 138 | +The result is more accurate timing data. |
| 139 | + |
| 140 | +Codeflash applies this same principle to functions with multiple inputs. For workloads with multiple inputs, |
| 141 | +it measures a function's intrinsic speed on each input separately. The total intrinsic runtime is the sum |
| 142 | +of these individual minimums. |
| 143 | + |
| 144 | + |
| 145 | +This approach proves highly accurate, even on noisy virtual machines. We use a 5% noise floor for runtime |
| 146 | +(10% on GitHub Actions) and only consider optimizations significant if they're at least 5% faster than the original function. |
| 147 | +This technique effectively minimizes measurement noise, giving us an accurate measure of a function's true, noise-free, intrinsic speed. |
0 commit comments