-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[benchmark] BenchmarkDriver and BenchmarkDoctor #18719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[benchmark] BenchmarkDriver and BenchmarkDoctor #18719
Conversation
24400f9
to
46981fe
Compare
Moved result formatting methods from `PerformanceTestResult` and `ResultComparison` to `ReportFormatter`, in order to free PTR to take more computational responsibilities in the future.
Also removed inused imports.
Moving the `captured_output` function to own file. Adding homegrown unit testing helper classes `Stub` and `Mock`. The issue is that the unittest.mock was added in Python 3.3 and we need to run on Python 2.7. `Stub` and `Mock` were organically developed as minimal implementations to support the common testing patterns used on the original branch, but since I’m rewriting the commit history to provide an easily digestible narrative, it makes sense to introduce them here in one step as a custom unit testing library.
The imports are a bit sketchy because it doesn’t have `.py` extension and they had to be hacked manually. :-/ Extracted `parse_args` from `main` and added test coverage for argument parsing.
See https://www.martinfowler.com/bliki/StranglerApplication.html for more info on the used pattern for refactoring legacy applications. Introduced class `BenchmarkDriver` as a beginning of strangler application that will gradually replace old functions. Used it instead of `get_tests()` function in Benchmark_Driver. The interaction with Benchmark_O is simulated through mocking. `SubprocessMock` class records the invocations of command line processes and responds with canned replies in the format of Benchmark_O output. Removed 3 redundant lit tests that are now covered by the unit test `test_gets_list_of_all_benchmarks_when_benchmarks_args_exist`. This saves 3 seconds from test execution. Keeping only single integration test that verifies that the plumbing is connected correstly.
LogParser doesn’t use `csv.reader` anymore. Parsing is handled by a Finite State Machine. Each line is matched against a set of (mutually exclusive) regular expressions that represent known states. When a match is found, corresponding parsing action is taken.
Added support for tab delimited and formatted log output (space aligned columns as output to console by Benchmark_Driver).
Measure more of environment during test In addition to measuring maximum resident set size, also extract number of voluntary and involuntary context switches from the verbose mode.
@eeckstein Please review 🙏 |
* Moved the functionality to compute median, standard deviation and related statistics from `PerformanceTestResult` into `PerformanceTestSamples`. * Fixed wrong unit in comments
Introduce algorithm for excluding of outliers after collecting all samples using the Interquartile Range rule. The `exclude_outliers` method uses 1st and 3rd Quartile to compute Interquartile Range, then uses inner fences at Q1 - 1.5*IQR and Q3 + 1.5*IQR to remove samples outside this fence. Based on experiments with collecting hundreads and thousands of samples (`num_samples`) per test with low iteration count (`num_iters`) with ~1s runtime, this rule is very effective in providing much better quality of sample population, effectively removing short environmental fluctuations that were previously averaged into the overall result (by the adaptively determined `num_iters` to run for ~1s), enlarging the reported result with these measurement errors. This technique can be used for some benchmarks, to get more stable results faster than before. This outlier filering is employed when parsing `--verbose` test results.
Option to exclude the outliers only from top of the range, leaving in the outliers on the min side.
The `run` method on `BenchmarkDriver` invokes the test harness with specified number of iterations, samples. It supports mesuring memory use and in the verbose mode it also collects individual samples and monitors the system load by counting the number of voluntary and involuntary context switches. Output is parsed using `LogParser` from `compare_perf_tests.py`. This makes that file a required dependency for the driver, therefore it is also copied to the bin directory during the build.
Replaced guts of the `run_benchmarks` function with implementation from `BenchmarDriver`. There was only single client which called it with `verbose=True`, so this parameter could be safely removed. Function `instrument_test` is replaced by running the `Benchmark_0` with `--memory` option, which implements the MAX_RSS measurement while also excluding the overhead from the benchmarking infrastructure. The incorrect computation of standard deviation was simply dropped for measurements of more than one independent sample. Bogus aggregated `Totals` statistics were removed, now reporting only the total number of executed benchmarks.
`BenchmarkDoctor` analyzes performance tests and reports their conformance to the set of desired criteria. First two rules verify the naming convention. `BenchmarkDoctor` is invoked from `Benchmark_Driver` with `check` aurgument.
`BenchmarkDoctor` measures benchmark execution (using `BenchmarkDriver`) and verifies that their runtime stays under 2500 microseconds.
Detect setup overhead in benchmark and report if it exceeds 5%.
9c35c11
to
34db547
Compare
I've pushed fixes for minor formatting issues raised by python lint. |
This needs to be finished with function approximating normal range based on the memory used.
34db547
to
f38e6df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm.
I didn't review in very detail, but I trust you that the changes will not break anything. It's great that you added all the test.
Regarding the test in Benchmark doctor which checks for naming convention: we can check for this but to be clear: just violating naming conventions does not make it worth renaming benchmarks and breaking history with this.
benchmark/scripts/Benchmark_Driver
Outdated
if len(test) > 40: | ||
BenchmarkDoctor.log_naming.warning( | ||
"'%s' name is longer than 40 characters.", test) | ||
def _name_is_at_most_40_chars_long(measurements): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this rule makes sense. If longer names cause troubles in reports, we should rather fix the reports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried my best to fix the reports before, see the still open #9999. But there is no way to make it work on mobile. GitHub doesn’t give us much control over the CSS, which is the key.
If we follow your previous rename-benchmark-when-runtime-changes-significantly rule, we will be renaming most (all?) of the tests that violate the max-length-40 naming rule while fixing the runtime-under-2500-microseconds violations anyway.
I have tried my best to fix the reports before, see the still open #9999. But there is no way to make it work on mobile. GitHub doesn’t give us much control over the CSS, which is the key. So I've resigned to just making the reports table not scroll on desktop, where it would be sufficient to just shorten the overly long names to the 40 character limit, without introducing all the complexity of word-breaking in the report that's in #9999. That's the motivation for the naming rule. If we follow your previous precedent rename-benchmark-when-runtime-changes-significantly as a rule, we will be renaming most (all?) of the tests that violate the max-length-40 naming rule while fixing the runtime-under-2500-microseconds rule violations anyway. Could you ask @swift-ci to do python lint and benchmark, please? |
That's fine. But in general, I really don't want to mass-change/rename benchmarks which are not noisy anyway, even if they have >2.5ms/iteration. It would cause a big disruption in our performance tracking. |
@swift-ci benchmark |
@swift-ci Please Python lint |
Could you explain more about the disruption that would cause, so that we may find a good solution? Getting the runtimes down is quite crucial to driving the noise out as well as to speeding up the whole measurement. Maybe we should take this discussion to forum once this lands... |
@swift-ci benchmark |
@swift-ci Please Python lint |
ok, let's discuss it on the forum |
Build comment file:Optimized (O)Regression (1)
No Changes (446)
Unoptimized (Onone)Regression (9)
Improvement (12)
No Changes (426)
Hardware Overview
|
@swift-ci smoke test and merge |
1 similar comment
@swift-ci smoke test and merge |
This is followup to #18124 that makes use of the direct memory measurements in
Benchmark_O
with the--memory
option from theBenchmark_Driver
. The PR also introduces thorough unit test coverage of theBenchmark_Driver
and all newly added functionality. Summary of changes:Benchmark_Driver
commandsubmit
which is no longer used.test_utils.py
with homegrown unit testing helpers to support mocking.LogParser
class for reading the output fromBenchark_O
, including--verbose
mode.PerformanceTestsSamples
class for computing thorough sample statistics, which can be used to exclude outliers.BenchmarkDriver
class for running benchmarks. This is invoked as a newcheck
command to theBenchmark_Driver
, with an optional--verbose
parameter.BenchmarkDoctor
class for validating that benchmarks comply with the set of requirements (naming, runtime, memory) that ensure high quality of the Swift Benchmark Suite.BenchmarkDriver
is now dependent oncompare_perf_tests.py
, so it is also deployed toswift-bin-dir
during build.The incremental rewrite of the
Benchmark_Driver
's guts that adds full unit test coverage using the StranglerApplication refactoring pattern maintains the original functionality, with the exception of removing the bogus aggregate statistics from the last Totals row in the log. It now only reports the total number of executed benchmarks. The MAX_RSS column now reports the delta of memory used before and after running the benchmark, excluding the overhead of measurement infrastructure (i.e. the reported numbers are now lower by about 10MB).I've structured the PR as a series of small, incremental changes with detailed descriptions in the commit messages, it is therefore best reviewed sequentially by individual commits.