-
Notifications
You must be signed in to change notification settings - Fork 10.5k
benchmarks: make benchmarks more stable #18096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
benchmarks: make benchmarks more stable #18096
Conversation
@swift-ci smoke test |
@swift-ci smoke benchmark |
@palimondo I went ahead and fixed some of the benchmark issues. |
Build comment file:Optimized (O)Regression (16)
Improvement (34)
No Changes (402)
Added (8)
Removed (8)
Hardware Overview
|
@eeckstein Good to see! Let me have a look at these today, but I need to review the changes in more detail. On a cursory glance, I'd lower the workload in CVSParsing much lower still.
But since that is the single modification to that benchmark, I'd delay changing it until we land the |
0ac8dae
to
bf9256b
Compare
@swift-ci smoke test |
I think I can get to look at this in detail only over the weekend, but I had a slightly different pattern for taking the setup overhead out of the main loop in my mind… I’m also a bit confused by the use of |
@palimondo Makes sense, I changed the iteration time of the renamed benchmarks (like CSVParsing) to be < 2.5ms. In this PR I addressed all the benchmarks where I saw a problem for reducing the iteration count (and increasing the sample count). So this PR should give us a good starting point. |
Global variables are initialized lazily. The blackHole in the setup functions make sure the initialization is done before the run-functions are called. |
I see. The pattern I was thinking to use was to have optional global, initialize in |
That's a good point. But I think in most cases the data overhead is not that big and it does not really matter. Therefore I prefer the simpler method (of just referencing the global instead of an optional). Actually I have an idea to simplify this setup/teardown procedure at all, which would solve this problem. I'll prepare a PR for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of these changes LGTM, but I'd like to see the Array2D
initialization with reserved capacity.
I don't think its necessary to rename tests where the change in runtime is due to fixing bugs by adding blackhole
in the right places. Before this, most of these tests were reporting just the timing of the setup overhead anyway. This is really mostly a bugfix PR.
Can you do one more smoke benchmark, please?
} | ||
return A | ||
}() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure what was the main point of this benchmark before, when I've noticed it has unstable memory consumption. From the robust-microbench report:
The Array2D benchmark has significant memory use range of 7MB (3292 — 5101 pages or 12.9 MB — 19.9 MB)! It creates a 1K by 1K [[Int]], without reserving a capacity. The pure Int storage is at least 8 MB, plus there is some constant overhead per Array. I guess the variation depends on how big contiguous memory regions the allocator gets, while the arrays are growing when append is called and they sometimes need to be copied to new region. Though I’m not sure this is the point of the test and it maybe should be rewritten with reserved capacity for stable memory use.
Seeing you've taken the array initialization out into the setUpFunction
… would you mind fixing it to use reserved capacity? How about something like this:
return Array(repeating: Array(0 ..< 1024), count: 1024)
(I'm assuming that initializing from range does pre-reserve the capacity, but not sure...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified it. But note that your elegant single-liner is not equivalent to the original construction - performance wise, because it reuses the same inner array. This then triggers copies-on-write in the main benchmark loop.
@@ -15,10 +15,11 @@ import TestsUtils | |||
public let Array2D = BenchmarkInfo( | |||
name: "Array2D", | |||
runFunction: run_Array2D, | |||
tags: [.validation, .api, .Array]) | |||
tags: [.validation, .api, .Array], | |||
setUpFunction: { blackHole(Arr) }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should store these workload data structures in Optional
s, so that we can release them in the tearDownFunction
. Otherwise we're effectively leaking memory, as we go through the pre-commit benchmark suite. This single array here is 8MB.
@@ -13,58 +13,61 @@ | |||
import TestsUtils | |||
|
|||
public let ReversedCollections = [ | |||
BenchmarkInfo(name: "ReversedArray", runFunction: run_ReversedArray, tags: [.validation, .api, .Array]), | |||
BenchmarkInfo(name: "ReversedArray2", runFunction: run_ReversedArray, tags: [.validation, .api, .Array], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think renaming ReversedCollections is necessary. You are not changing the workload, just fixing a bug in test. (Unless it was meant as a technique to highlight them for review in the benchmark report...)
@@ -13,9 +13,10 @@ | |||
import TestsUtils | |||
|
|||
public let ClassArrayGetter = BenchmarkInfo( | |||
name: "ClassArrayGetter", | |||
name: "ClassArrayGetter2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does bug fix warrant a name change?
@@ -59,7 +59,7 @@ func genStructArray() { | |||
|
|||
@inline(never) | |||
public func run_ArrayOfGenericPOD(_ N: Int) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is 50% overhead according to my measurements. I guess some blackhole
s are needed in there, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see any problems here, but it's a good idea to add blackHoles here.
bf9256b
to
128c66f
Compare
@palimondo Thanks for reviewing. I solved the problem of leaking memory through the global workloads with a new (and much simpler) mechanism for separating setup time from core benchmark time. About renaming benchmarks: the purpose of renaming a benchmark is to not introduce false changes in the benchmark history. It has nothing to do if it's a bug fix or not. If the benchmark time changes (significantly) by changing the benchmark itself (and not the compiler) then it should be renamed. |
@swift-ci smoke benchmark |
Build comment file:Optimized (O)Regression (21)
Improvement (24)
No Changes (405)
Added (10)
Removed (10)
Hardware Overview
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is solving a wrong problem. I'll expand in the followup comment.
Otherwise, the blackhole
and range fixes, as well as the Array2D
initialization look solid.
@eeckstein, @gottesmm I think there is some misunderstanding of the setup overhead issue I've raised in my report, when I suggested it should to be removed:
This, in and of itself, is completely unrelated to the stability of the measurement results. The suggestion to address setup overhead is predicated on the main proposed change: measuring all benchmarks with
In fact, to be most precise you should compute this with the lowest
...basically the task list for fixes in your 5th commit here (1b5ad3c). High setup overhead ratio is useful for detecting the case when compiler optimizations kicked-in and eliminated some part of the workload and judicious Now, why and how should we exclude the setup overhead once we start to measure with
So the whole point of moving the setup work that is detected as overhead into the Erik's initial attempt here in an earlier, overwritten commit with I understand the motivation to make writing of benchmarks as easy as possible and that setting up an optional global variable that would be cleaned up in the Maybe we can go back to the original Eric's idea with the lazily initialized globals? What if the pattern was to make them IUOs (so that the main workflow body can continue to pretend it's unchanged local variable) and do something like But to reiterate, with the current measurement method, setup overhead plays absolutely no role. It becomes an issue, affecting about 10% of the benchmarks in the suite, only when we start measuring with |
@palimondo The motivation of replacing the setup function is not related to your proposal at all. It's a (more or less) non-functional change which just solves a usability problem: it's easy to get it wrong with global variable because
|
@swift-ci smoke test |
@eeckstein Could you please hold this until after we land #18124? There's looming merge conflict in I understand the usability point of view, and agree that the ability to write benchmarks correctly is very important. But the proposed change (removing Ensuring that benchmarks are written correctly, adhering to all the assumptions of our testing process is the main point of introducing benchmark validation infrastructure that I'm proposing with |
There are two equally valid approaches to benchmarking frameworks with regards to which part controls the time measurement: the driver or the test. In Swift Benchmarking Suite as it currently stands, it is the driver that controls the measurement and the benchmark supplies the function to be measured. In the other approach you hand the stopwatch to the test itself, which is exactly what the changes proposed in the first 3 commits here do. This would create a hybrid situation in SBS. With the test-controls-timing approach it is possible to precisely control what get's measured. That is crucial if you want to measure effects with nanosecond precision. It's the approach taken by @lorentey's Attabench framework. But since everything is controlled from the test, the responsibility for correct use lies completely in the hands of test author. With the driver-controls-timing approach we have more control over the measurement strategy as a whole process. The trade-off is that benchmarks have to give up fine-grained control and need to conform to the set of assumptions of this process. This is more like using unit testing framework, with special affordances for rare but sometimes necessary tasks like I believe that the second approach is better for the SBS, because it gives us leverage to measure, improve and control its quality. When the details of what and how get's measured are left underspecified, it's easy to deviate from the best practices and the result is set of benchmarks that take too long to run and report spurious changes. |
I don't agree. It is easily possible to measure the setup overhead. You just have to compare, e.g. num_iters=1 with num_iters=10 / 10. IMO that's already sufficient. We can even do better by measuring the run_function time in addition and subtracting the benchmark time. PS: to clarify: the num-iters methods can be used to detect "bad" benchmarks where setup overhead is not extracted yet. Calculating time(run_function) - benchmarkTime gives the time of the setup for benchmarks where it is already extracted (if we want to measure that for some reason). |
Sure, no problem |
128c66f
to
3a9b763
Compare
@swift-ci smoke test |
I don't see how this would work with the change as is. It looks to me there is only one global variable The second global Now in the Am I misunderstanding how this works? |
But even if you worked around this issue by tracking the wall time inside If you think the ergonomics around its use in conjunction with |
|
Ah, now I get your point. That's true, but I have seen that in most benchmarks the setup time is so small that it does not hurt the overall runtime at all (it's just important that the setup time is not counted into the real benchmark time). |
Just to be sure I didn't overlook it before, the workaround to maintain the ability to detect overhead in the diff above, with I agree the issue of setup overhead is very minor. I have been saying this in the forums ever since Michael got hung upon this point there. Which is why I don't think we have to rush in with a fix, without having the tools that measure it in the first place. It is significant for just a handful of tests, very minor for the vast majority of tests that exhibit the overhead and it affects less than 10% of all benchmarks in the suite. I think the current setup works just fine. The move to handing the stopwatch to the test in order to address setup overhead goes against the current design of the SBS and will make it harder to deal with it correctly, when the need arises. |
Thinking too much about the setup overhead is a case of premature optimization. I think if we stuck to the current design, where you write the benchmark in a single function and don't have to think about it until the overhead get's flagged as an issue by I was saying in the discussion of PR #12415, why it's safe to move them outside of the sampling look: the All I was trying to say is that we should maintain the support for The pattern with lazy global variable you came up with was great. The potential memory leak we were discussing also isn't that big of a deal. When running test via To avoid that minor issue, we can use |
Some benchmarks wrongly executed the loop N+1 times ("0...N" instead of "0..<N") mt
…g important parts of a benchmark
…rks where setup time is significant
…enchmarks. Those are benchmarks which took way too long or short to execute a single iteration or benchmarks which changed in time anyway because of previous fixes. I renamed those benchmarks so that they are now treated as "new" benchmarks.
…ark. Force all globals to be initialized in the setup functions
…oop. This is important to minimize the runtime when many samples are taken.
3a9b763
to
bd43d54
Compare
ok, I see you have a strong opinion on this. I removed the commits which introduce the new setup mechanism (and updated the other commits). With the old method I agree with you that the setup/teardown function calls should be moved out of the sample loop. I added a commit for this. |
@swift-ci smoke benchmark |
@@ -40,7 +40,7 @@ class RefArray<T> { | |||
// elements should be a nop. | |||
@inline(never) | |||
func genEnumArray() { | |||
_ = RefArray<Int?>(3) | |||
blackHole(RefArray<Int?>(3)) | |||
// should be a nop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about all these benchmarks that have the // should be a nop
comments. What are they supposed to measure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it means that it's expected that the optimizer removes everything. I'm not sure if this makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, it's better to replace such benchmarks (which only test if the optimizer removes everything) by a lit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point exactly:
Aside: I did look into fixing some of the benchmarks with setup overhead and I definitely need help with the
ArrayOf*
benchmark family. I cannot locate the source of the overhead and I don’t understand what they are testing. They are full of methods that create an array and immediately throw it away (assigning to _) followed by comment// should be a nop
. If they are indeed testing that some optimization takes place, I’d argue they should be tests/validation-tests and not benchmarks…
@swift-ci smoke benchmark |
} | ||
Workload_${Name} = Workload.${Name} | ||
} | ||
var Workload_${Name}: Workload! = Workload.${Name} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole file is absolutely beautiful. I haven't seen this benchmark before, as I was stuck on a November 2017 fork. This is exactly the correct pattern to emulate when you have a significant setup overhead. Not to mention the effective application of GYB 😬 in benchmarks… Excellent engineering all over here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM! Thank you for hearing me out!
Build comment file:Optimized (O)Regression (14)
Improvement (12)
No Changes (424)
Added (10)
Removed (10)
Hardware Overview
|
@swift-ci smoke test and merge |
1 similar comment
@swift-ci smoke test and merge |
@swift-ci smoke test OS X platform |
@swift-ci smoke test macOS |
The time per iteration should roughly stay constant and not depend on the number of iterations.
*) for benchmarks with a significant setup time, move the setup to the setUpFunction
*) some benchmarks wrongly executed the loop N+1 times ("0...N" instead of "0..<N")
*) added some blackHole calls to prevent the optimizer removing important parts of a benchmark
*) lowered the workload for the CSVParsing benchmarks because 1 single iteration took too long
*) moved the setup and teardown function calls outside the sample loop
For a few benchmarks this changed the benchmark score. I renamed those benchmarks so that they are now treated as "new" benchmarks.