Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an API to compare benchmark results #5297

Closed
4 tasks done
sheremet-va opened this issue Feb 26, 2024 · 17 comments · Fixed by #5398
Closed
4 tasks done

Add an API to compare benchmark results #5297

sheremet-va opened this issue Feb 26, 2024 · 17 comments · Fixed by #5398
Labels
feat: benchmark Issues and PRs related to the benchmark feature p3-significant High priority enhancement (priority) pr welcome

Comments

@sheremet-va
Copy link
Member

sheremet-va commented Feb 26, 2024

Clear and concise description of the problem

There is currently no way to compare benchmark results between different runs. For example, it's possible to change the implementation details without changing the benchmark, and there is no way to confirm that it didn't introduce any regressions except for manually making screenshots in the terminal.

Suggested solution

Provide a flag to vitest bench command (like --compare=./bench.json) to compare the current benchmark run with the previous one. We already support dumping the benchmark result in json via --reporter=json, so it can be reused for diffing.

Alternative

No response

Additional context

I think it should print results the same way we do now in a table, just compare it also with the previously stored results.

I think it might also be nice to store some information about the current machine (OS, cores, anything that can influence the result) in that benchmark and show it in the table.

Validations

@sheremet-va sheremet-va added feat: benchmark Issues and PRs related to the benchmark feature p3-significant High priority enhancement (priority) pr welcome labels Feb 26, 2024
@mrazauskas
Copy link

What about https://codspeed.io or https://bencher.dev? I like how Bencher is calling it continuous benchmarking.

@sheremet-va
Copy link
Member Author

sheremet-va commented Feb 26, 2024

What about https://codspeed.io or https://bencher.dev? I like how Bencher is calling it continuous benchmarking.

These are all external tools and they are meant to be used in CI. Since we don't control what machine is running the benchmark, this issue is only about manual check on your own machine.

It's about iterating on the same function, not about catching regressions.

@mrazauskas
Copy link

It's about iterating on the same function, not about catching regressions.

Sort of comparing performance between two branches? Would that be possible without JSON being written to disk?

@sheremet-va
Copy link
Member Author

Would that be possible without JSON being written to disk?

The idea I am describing in the issue is to use the result from the json reporter which dumps the values on the disk already.

@hi-ogawa
Copy link
Contributor

Interesting idea!

Speaking of --compare flag, I rememberd this webpack plugin which has a similar idea and compares the bundle stats between the current run and baseline https://github.com/relative-ci/bundle-stats/tree/master/packages/webpack-plugin#bundlestatswebpackpluginoptions

It looks like their approach is that they require either baseline: true or BUNDLE_STATS_BASELINE=true to explicitly dump the current result as a baseline, then such baseline is automatically used to compare against for following runs since compare: true by default.

@sheremet-va
Copy link
Member Author

We already support benchmark json output. We decided to remove the samples from it and introduce a new API to tinybench that allows comparing results with the provided benchmark run without actually running the code.

@antoine-coulon
Copy link

Hello @sheremet-va

That's actually a great idea, I'm working on skott and especially on a pull request that is meant to compare skott's performances over time when running analysis on specific JS/TS projects.

The current way I'm implementing result comparison is by only using n and n-1 versions using git diff to witness the changes, but it would be great to have a complete history indeed and store more than one previous version. For that I tried codspeed but the CI just takes a while, I don't know if it's related to the free-tier or what but the job is taking more than 1h50 to complete while just using vitest bench API takes few minutes so it's not an option for me at all.

I think it might also be nice to store some information about the current machine (OS, cores, anything that can influence the result) in that benchmark and show it in the table.

To be honest my primary concern was to run in the CI to limit hardware-related variations, even though adding OS related information could be useful when running benchmarks from a local machine.

I feel like trying to diff a set of benchmarks add OS properties in the output might be confusing, you might want to do that at the filename level so that each OS combination can get its own diff?

Currently I'm re-writing the outputFile generated by vitest into something like result-node-18.x.json and I'm not mentioning the OS because this CI job is only running on ubuntu for now, even though it could become something like node-18-ubuntu.json. Also I found useful to add things like git hash and branch the benchmark was generated.

So at some point I was kind of wondering, should I create a vitest plugin and do something like codspeed but instead of storing the data on my private cloud, just emit the data files in a dedicated location and provide a way to compare them over time. But if you're willing to integrate that in the core, it might not be that relevant? What do you think?

@sheremet-va
Copy link
Member Author

To be honest my primary concern was to run in the CI to limit hardware-related variations, even though adding OS related information could be useful when running benchmarks from a local machine.

Adding OS-related information is just to give a clear message that the benchmark might be off because it was running on another machine, or even throw an error.

Currently I'm re-writing the outputFile generated by vitest into something like result-node-18.x.json and I'm not mentioning the OS because this CI job is only running on ubuntu for now, even though it could become something like node-18-ubuntu.json. Also I found useful to add things like git hash and branch the benchmark was generated.

How do you store the generated output in CI?

@antoine-coulon
Copy link

Adding OS-related information is just to give a clear message that the benchmark might be off because it was running on another machine, or even throw an error.

Yeah better have more information than not enough. Also I'm not sure how stable the GitHub Actions hardware is when not using custom pools, default pools might have agents with more or less cores, so variations can even happen there indeed.

How do you store the generated output in CI?

For now it's only a JSON file written and committed by the CI in my repo (at a specific location near-by the benchmark files themselves) but it could be more sophisticated.

Consequently for now I'm not storing the history of all benchmarks, I'm just overriding it each time a new PR runs with the new results, allowing me to do the diff between n and n-1 versions. But it could be great to keep track of all the benchmarks nonetheless to track big regressions/improvements over time, n and n-1 is a very small dataset

@hi-ogawa
Copy link
Contributor

hi-ogawa commented Mar 17, 2024

Assuming that the use case is mostly a local comparison, I made a prototype as a custom reporter:
https://stackblitz.com/edit/vitest-dev-vitest-mm6syc?file=vite.config.ts

demo
# I used VITEST_BENCH_xxx env var for prototype

# save bench data on main branch
# --benchmark.outputFile=main.json
$ VITEST_BENCH_OUTPUT_FILE=main.json npx vitest bench --run
...

# suppose you switched a branch and compare against main
# --benchmark.compare=main.json
$ VITEST_BENCH_COMPARE=main.json npx vitest bench --run
...
 RUN  v1.4.0 /home/projects/vitest-dev-vitest-mm6syc

[... here current deafult reporter ....]
 ✓ test/basic.bench.ts (2) 2483ms
   ✓ sort (2) 2471ms
     name          hz      min      max     mean      p75      p99     p995     p999      rme  samples
   · normal   58.7590  13.7350  40.1650  17.0187  16.4350  40.1650  40.1650  40.1650  ±13.39%       30   fastest
   · reverse  10.6565  74.6000   115.93  93.8395   114.16   115.93   115.93   115.93  ±14.26%       10  

[... custom compare reporter ....]
 [BENCH] Comparison
   current  : bench-default.json
   baseline : main.json

sort
  normal 58.759hz [baseline: 73.419hz] [change: 0.80x ⇓]
  reverse 10.656hz [baseline: 10.870hz] [change: 0.98x ⇓]

A few things I noticed:

  • isn't this technically a reporter feature? if so, having a flag only for this reporter seems unusual. But being able to quickly choose --compare from cli would be important, so I think it's fine to make this special flag.
  • currently json reporter somehow lacks bench filename (the data is only { suitename: BenchmarkResult[] }). So, to persist previous run, probably we can forget about current json reporter and start with better internal format. For this, we can also exclude gigantic samples because I don't think it's needed for comparison report.

FYI, I was also looking around prior arts and for example this one https://bheisler.github.io/criterion.rs/book/user_guide/command_line_options.html#baselines has three flags for comparison purpose:

  • --save-baseline: like --benchmark.outputFile
  • --baseline: like a proposed --compare
  • --load-baseline: this allows skipping running benchmark and loading existing result as a current run.

@hi-ogawa
Copy link
Contributor

We decided to remove the samples from it and introduce a new API to tinybench that allows comparing results with the provided benchmark run without actually running the code.

@sheremet-va Can you elaborate this? I thought this is Vitest benchmark reporter feature, but you want to move this feature to tinybench?

@sheremet-va
Copy link
Member Author

sheremet-va commented Mar 18, 2024

Assuming that the use case is mostly a local comparison, I made a prototype as a custom reporter

If I remember correctly, the reporter already gets the result sorted - so we can't do this in our own reporter because custom reporters would have to reimplement it.

Can you elaborate this? I thought this is Vitest benchmark reporter feature, but you want to move this feature to tinybench?

I wanted to make sure that tinybench supports providing results based on benchmarks that are not actually running.

  • isn't this technically a reporter feature? if so, having a flag only for this reporter seems unusual.

If this is a first-class feature, I think it is fine. We already expose some flags that are only relevant for reporters.

  • currently json reporter somehow lacks bench filename

We can change the format of json. Benchmark is an experimental feature and doesn't follow semver.

@hi-ogawa
Copy link
Contributor

If I remember correctly, the reporter already gets the result sorted - so we can't do this in our own reporter because custom reporters would have to reimplement it.

What you mean by "sorted'? I don't know about how current default tty reporter works, but if comparison summary is required only at the end like in my prototype, then it has complete information as onFinished(files: File[]) where Test.result.benchmark holds full tinybench output TaskResult.

@sheremet-va
Copy link
Member Author

sheremet-va commented Mar 18, 2024

What you mean by "sorted'?

The result in the suite.result.benchmark is sorted based on the median result (fastest first). I expect previous benchmarks to appear in the same table sorted marked like ${name} (previous) or something.

This is just my expectation based on what we already do. If other frameworks do it differently and more ergonomically, then we can follow them, but I would expect to see the difference in all table columns, no?

@hi-ogawa
Copy link
Contributor

hi-ogawa commented Mar 18, 2024

I expect previous benchmarks to appear in the same table sorted marked like ${name} (previous) or something.

I might be still missing something, but let me confirm with my PR. My proof of concept is currently separated at the end, but I think it's possible to show together in the current table like you expect without changing anything on tinybench. Let me try that in the PR.

@hi-ogawa
Copy link
Contributor

It's a rough mockup, but it's possible to create a table like this:

image

@sheremet-va Is the direction okay or did you have something else in your mind?

@sheremet-va
Copy link
Member Author

sheremet-va commented Mar 26, 2024

Is the direction okay or did you have something else in your mind?

Yes, this is how I imagined it. I am not qualified enough to tell if it's better this way, so I would wait for more feedback.

What we can also do is reverse the table and duplicate the column instead:

https://github.com/google/benchmark/blob/main/docs/tools.md#modes-of-operation

Also, maybe we should print two tables: the first is the same as we do now, and the second one has a difference in each field:

Screenshot 2024-03-26 at 15 39 56

Side note: would be awesome if we could create a graph 👀
https://pytest-benchmark.readthedocs.io/en/latest/comparing.html

@github-actions github-actions bot locked and limited conversation to collaborators May 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat: benchmark Issues and PRs related to the benchmark feature p3-significant High priority enhancement (priority) pr welcome
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants