-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[benchmark] Measure memory with rusage and a TON of gardening #18124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[benchmark] Measure memory with rusage and a TON of gardening #18124
Conversation
The testing of `Benchmark_O` and its public interface needs more documentation. The `lit` tests can be embedded in anything. Lets combine these.
Reformatting the `lit` test file to markdown format: * Removed the unnecessary `// `prefix. * Requires preambule in comment * Test documentation intro
* Added check for running by test number. * Documented “dry run” using `--list`. * Moved real run test to the end. * Added checks for logging benchmarks measurements (header and stats). * Added check that specifying the same test multiple times runs it just once.
Report only the total number of executed tests. Aggregating MIN, MAX and MEAN values for all executed benchmarks together (with microsecond precision!) has no statistical relevance.
Moved the formatting of`BenchmarkResults` into `runBenchmarks` function which already contained the logging of header and the special case of unsupported benchmark.
Test the `--verbose` mode output and the `--num-samples` option.
The `Test`struct was forwarding everything to `BenchmarkInfo` and its only contribution was carrying of the index. Tuple is fine for that.
The indices (test numbers) are strings on both ends of the IO — in user input as well as when printed to the console. We can spare few conversions and just store them directly as `String`.
Measure and report system environment indicators during benchmark execution: * Memory usage with maximum resident set size (MAX_RSS) in bytes Proxy indicators of system load level: * Number of Involuntary Context Switches (ICS) * Number of Voluntary Context Switches (VCS) MAX_RSS delta is always reported in the last column of the log report. The `--verbose` mode additionaly reports full values measured before and after the benchmark execution as well as their delta for MAX_RSS, ICS and VCS.
* Fix: flushing stdout before crashing to enable testing. * Added tests that verify reporting of errors when the parsing of command line arguments fails.
Refactored to use Swift’s idiomatic error handling. In case of invalid argument errors, the message is printed to `stderr` and we exit gracefully with error code 1. We no longer crash the app in most cases.
We no longer crash when the argument value parsing fails, but report an error.
Added handling for arguments that don’t have a value (flags), or whose values are optional (they can be empty). This covers all the argument types currently used in Benchmark_O with the single `optionalArg` function.
* Extracted tag parser * Reordered all parsing of optional arguments to be grouped together.
Processing command line arguments is an integral part of `TestConfig` initialization. Make it explicit.
Moving towards immutable TestConfig. The pattern used is inspired by the`PartialBenchmarkConfig` from [criterion.rs](https://github.com/japaric/criterion.rs/blob/master/src/benchmark.rs). The idea is to have a temporary mutable struct which collects the command line arguments and is filled by the “parser” (which will be extracted later). The partial configuration is used to initialize the immutable `TestConfig`. The optionality of the command line parameters is represented by the corresponding properties being `Optional`. (Accidentaly, all our arguments are optional… but that’s beside the point.) Null coalescing operator is used to fill in the defaults during initialization. For some reason, the compiler found `optionalArg` calls for `tags` and `skip-tags` ambiguous, when types changed to `Set<BenchmarkCategory>?`, which was resolved by providing fully qualified key paths for them (`\PartialTestConfig.tags`).
The `TestConfig` is now completely immutable. Arguments that are not necessary for running benchmarks were moved inside the config initialization, which also subsumed the `printRunInfo` function for displaying the configuration details in verbose mode.
Refactored `filterTests` to use two nested functions that better reveal the underlying logic.
Renamed `optionalArg` to `parseArg` because it can now also handle the positional arguments. Now all the command line arguments are handled through this function. Minor naming improvement of the `filterTests` parameter.
Moved the argument parsing logic into new class `ArgumentParser`. The `checked` function is also moved to the ArgParse.swift, next to the parser.
The `ArgumentParser` now has a configuration phase which specifies the supported arguments and their handling. The configured parser is then executed using the `parse` method which returns the parsed result.
rusage
and a TON of gardeningMoved the printing of help message inside the `ArgumentParser`, which has all the necessary info. Added test that checks the `--help` option.
In case of invalid command line arguments, there is no reasonable recovery, the `ArgumentParser` can exit the program itself. It is therefore no longer necessary to propagate the `ArgumentError`s outside of the parser.
The `parseArgs` funtion is now a private method on `ArgumentParser`. Removed `Arguments` struct and moved the `Argument` as a nested struct into the parser. Adjusted error messages and the corresponding checks.
The `--help` option now prints standard usage description with documentaion for all arguments: ```` $ Benchmark_O --help usage: Benchmark_O [--argument=VALUE] [TEST [TEST ...]] positional arguments: TEST name or number of the benchmark to measure optional arguments: --help show this help message and exit --num-samples number of samples to take per benchmark; default: 1 --num-iters number of iterations averaged in the sample; default: auto-scaled to measure for 1 second --iter-scale number of seconds used for num-iters calculation default: 1 --verbose increase output verbosity --delim value delimiter used for log output; default: , --tags run tests matching all the specified categories --skip-tags don't run tests matching any of the specified categories; default: unstable,skip --sleep number of seconds to sleep after benchmarking --list don't run the tests, just log the list of test numbers, names and tags (respects specified filters) ````
Printing of the MAX_RSS is now hidden behind the optional `--memory` flag.
@eeckstein Please review 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically LGTM.
I have some (mostly stylistic) comments about some minor issues.
case let .invalidType(value, type, argument): | ||
return (argument == nil) | ||
? "'\(value)' is not a valid '\(type)'" | ||
: "'\(value)' is not a valid '\(type)' for '\(argument!)'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could omit printing the type here. IMO it's sufficient to print "'(value)' is not a valid value for '(argument)'". Then you can also omit the (not-trivial) code to extract the type name. Printing the type name (e.g. "Int") is not very readable anyway.
Just a suggestion to make things simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was helpful user feedback. For example, when you use a wrong tag, the message will read: error: 'bogus' is not a valid 'BenchmarkCategory'
. So you'll know what to look for.
The type-checking parser for BenchmarkCategory
(func tags
) is called from a nested context within the attribute parsers for tags
and skip-tags
, where it doesn't have access to the argument name. The best it could do, would be to say 'bogus' is not valid value'
.
I'm not sure that's the right trade-off here. I think it's worth a little bit of code complexity for much better QoL.
benchmark/utils/ArgParse.swift
Outdated
if type.starts(with: "Optional<") { | ||
let s = type.index(after: type.index(of:"<")!) | ||
let e = type.index(before: type.endIndex) // ">" | ||
type = String(type[s..<e]) // strip Optional< > |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow a little bit hacky. As I said, you could just omit it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that's just how String processing in pure Swift looks like. 😕
benchmark/utils/ArgParse.swift
Outdated
try arguments.forEach { try $0.apply() } // parse all arguments | ||
return result | ||
} catch let error as ArgumentError { | ||
fflush(stdout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed because you exit with exit() and not with fatalError()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I read the man exit(3)
and I think I understand now. Does that mean I should add fflush(stdout)
before the fatalError
in the default catch?
benchmark/utils/ArgParse.swift
Outdated
} catch let error as ArgumentError { | ||
fflush(stdout) | ||
fputs("error: \(error)\n", stderr) | ||
fflush(stderr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here. Also, stderr is flushed by default
benchmark/utils/ArgParse.swift
Outdated
private func parseArgs() throws { | ||
|
||
// For each argument we are passed... | ||
var passThroughArgs = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
passThroughArgs is not used in the benchmarks. Can you remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I wasn't exactly sure what that does… ;-)
benchmark/utils/ArgParse.swift
Outdated
struct Argument { | ||
let name: String? | ||
let help: String? | ||
let apply: () throws -> () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only purpose of this closure is to distinguish between --help and all other arguments. Wouldn't a good old style "if arg == help ... " simplify the whole code and make it more readable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of that closure (introduced in e5cbfcc) is to type-erase the generic T
from parsers, so that they can be stored in []
. The parsing (type-checking) happens as side effect of its execution. It being useful for printing help is just a pleasant accident.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't exactly obvious, so I'll add a better documentation for all of this.
benchmark/utils/DriverUtils.swift
Outdated
|
||
/// Is verbose output enabled? | ||
var verbose: Bool = false | ||
let iterationScale: Int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename this to something like "sampleTime" and make it a Double, so that times < 1s can be specified? (If you prefer you can do it in a follow-up commit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes perfect sense! Frankly, I wasn't fully aware how it worked until I was writing the help description in the penultimate commit.
benchmark/utils/DriverUtils.swift
Outdated
help: "number of iterations averaged in the sample;\n" + | ||
"default: auto-scaled to measure for 1 second", | ||
parser: { UInt($0) }) | ||
p.addArgument("--iter-scale", \.iterationScale, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here: can you rename the option to e.g. --sample-time? That's actually what it is.
benchmark/utils/DriverUtils.swift
Outdated
let fixedNumIters: UInt | ||
let numSamples: Int | ||
let verbose: Bool | ||
let logMemory: Bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you re-add the comments?
benchmark/utils/DriverUtils.swift
Outdated
self.c = config | ||
} | ||
|
||
private static func usage() -> rusage { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better name for this function would be memoryUsage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought so too initially, but it is used to also measure system load (ICS, VCS). $ man getrusage
says it means "get information about resource utilization", so I guess more proper name would be getResourceUtilization()
. What do you think?
* Restored property doc comments on `TestConfig` * Better name for func `usage` is `getResourceUtilization`
Sample time is a better name for what was previously called `iter-scale`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eeckstein I'd like to maintain the type checking error in its current form. I also think the separation of parsing into syntax and type-check phases in separate methods is justified. I have added documentation around the need for the Argument.apply()
closure and all the logic of parsing arguments.
I believe I've addressed all your other points (renamed usage
and iter-scale
, cleaned fflush
use, restored docs on TestConfig
properties). Pushing 3 more commits now.
Please review 🙏
Funny story: I got stuck on the type change of the
Turns out But when this parser was specified as inline closure, it wasn't working. For the last few hours I was doing So I guess all that test hardening and precise type checking parsers are paying off:
|
* Improved documentation. * Corrected`fflush` usage in `parse` error handling. * Removed unused `passThroughArgs`.
bff0b23
to
362f925
Compare
Thanks! Looks good now. |
@swift-ci smoke benchmark |
@swift-ci smoke test |
Build comment file:Optimized (O)Regression (21)
Improvement (31)
No Changes (408)
Hardware Overview
|
Created SR-8377 to track the intermittently appearing bug mentioned above. |
This PR adds a feature to measure and report system environment indicators during benchmark execution:
Proxy indicators of system load level:
MAX_RSS delta is reported in the last column of the log report, when enabled using the
--memory
flag.The
--verbose
mode additionaly reports full values measured before and after the benchmark execution as well as their delta for MAX_RSS, ICS and VCS.The PR also introduces thorough test coverage of the
Benchmark_O
's public interface (command line arguments and log format). Thelit
tests were converted into markdown format, to ease their self-documentation (literate testing), resulting in executable specification of the API. See Benchmark_O.test.md.The DriverUtils.swift had a lot of accumulated technical debt, which required a ton of gardening to clean up. First a test coverage of a feature is established, followed by functionally neutral refactoring (labeled
[Gardening]
in the commit message). After the clean up, the new features were added on top of the improved code organization.Major refactoring of the command line argument parsing has improved their error handling. The previously solid argument syntax parsing (
parseArgs
) is joined inside theArgsParse.swift
with type checking of arguments (previously done ad-hoc, per-argument inDriverUtils.swift
) to formArgumentParser
class which handles the complete job of parsing command line arguments. It supports declarative configuration of the parsing, modeled after the interface of python'sargparse
module with the flexibleaddArgument
method.Program's API has remained largely unchanged, with the small exception of removed bogus
Totals
stats (c7fc745). Program's human facing interface was majorly improved to provide full documentation of all arguments with the--help
command.I've structured the PR as a series of small, incremental changes with detailed descriptions in the commit messages, it is therefore best reviewed sequentially by individual commits.