Performance/stress CI rework

This issue discusses what we need and what we are going to do for performance regression CI and stress test CI. They share infrastructure so I put them together.

### Requirements
* Performance tests include benchmark runs, start up time, min heap values, allocation/collection speed, num of instructions in allocation sequence etc. Basically any test that returns a metric may be included in the future.
* Performance tests per commit, per day, per week, etc. We need a plot to show performance over time.
* Performance tests per pull request (before merged). We need a plot or at least a table to show the performance diff.
* Stress tests should be running all the time. We aim to run stress tests for each benchmark for the most recent commit. If we have a newer commit after one run, we switch to the new commit. If we finish running all the benchmarks for the most recent commit, we go back to run missing benchmarks for previous commits (backtracking for N days).
* We set up the tests for mmtk-core (including testing mmtk-core with OpenJDK), but the infrastructure and code/script need to be reusable by the bindings (e.g. testing Julia/Ruby binding's performance overtime, etc).

#### Non goal
* This issue only talks about our set up and framework for doing performance CI and stress tests CI. Other correctness CI is a non goal for this issue.
* We do not plan to run the cross production of bindings x plans x benchmarks all the time, which is infeasible for us. We may prioritize the tests, and run important ones more frequently.
* The tests only serves the purpose of maintaining MMTk's performance and robustness. Checking the overall performance and robustness for a language is a non goal.

### Design

#### Job Triggering
* Github action: it is the easiest to just use Github action to trigger jobs. E.g. for new commits, for pull requests, for pull request labels, etc. It works well in most cases. A key shortcoming for Github actions is that it cannot queue a job for more than 24 hours. If we have a set of long running jobs, we cannot use Github actions to trigger them. Otherwise the queue'd jobs will be cancelled after 24 h.
* Github bot: it is flexible, as we can code our own logic on how to trigger CI jobs. 
  * We once set up a github bot on heroku with a free tier account. But Heroku free tier was cancalled. We need a host to run the bot.
  * For our stress test CI, it is convinent to let Github bot to trigger the work. It can figure out what tests have been run, and what tests should be run next. 
* We probably need both.

#### Job execution
* Github action runner: We have set up our own runners. We should keep using them.
* Github action workflows: We organise our CI jobs as Github action workflows. The workflows may be triggered by Github events, cron timer, or by bot.
* running-ng (https://github.com/anupli/running-ng): we will use it for running performance benchmarks. I would suggest we just rely on the tool for running all the stress tests and performance tests. We introduce features to the tool if we need to.
* We should avoid building VMs in the performance CI. Instead, we should download the build from somewhere else.

#### Results storage
* Logs: Logs could be expensive to store over a long time.
  * Git repo: we currently push logs to a git repo. We can keep doing that, as it costs nothing. The shortcoming is that it is slow to read, and we cannot concurrently uploading results. As we currently only have two CI machines, it works fine with rebase/push as the retry mechanism. It has a size limit of 25 MB per file which is enough for us.
  * Github artifects: Github action has 90 days retention for artifects. We can upload logs as artifects, and they will be retained for 90 days (enough for us to debug any recent issue). But we will lose them after 90 days. We cannot view logs for an old issue. We cannot retrospectively build data or graphs.
  * External storage: we can use any paid file storage service, or use any machine to store the logs. It may not worth it.
* Structured data: We should parse logs once the tests finish, and store the useful information in a structured way. Visualization should only use structured data, not raw logs.
  * File: csv, json, or whatever format. We can store it in a git repo, or with external storage.
  * Database: It requires a host. But it can be used as a data source for visualization tools.
  * Externally managed: managed by the visualization frameworks like codespeed.

#### Visualization
* Performance
  * Performance timeline framework: We do not need to maintain the code. We may not be able to customize much of the visualization (you get what you get -- unless we would like to contribute to those projects which is unlikely). They all require a web host to run.
    * [Codespeed](https://github.com/tobami/codespeed): [Example](https://speed.python.org). Mainly used by CPython and PyPy. We post performance data to the server, and it renders the results.
    * [RebenchDB](https://github.com/smarr/ReBenchDB): [Example](https://rebench.stefan-marr.de/TruffleRuby/timeline). Not sure how easy it is to feed data to it. It uses Postgres as its data source. No detailed documentation yet. It says it is inspired by codespeed.
  * Other data visualization/monitoring framework:
    * [Grafana](https://github.com/grafana/grafana): [Example](https://play.grafana.org/). A popular framework for data visualization. It is usually used for monitoring real-time data (like work load for servers, etc). It might be an overkill for us. But it should work for our needs.
  * Static page generation with our own script: We need to maintain the script. We can generate static pages which are hosted with Github pages. This is what we currently do ([link](https://www.mmtk.io/ci-perf-result/master/openjdk_semispace_history.html)). 
  * Repurpose plotty: we should be able to use plotty to generate graphs to compare two commits easily. If we have a solution for timeline, we can use this approach to get a graph to compare performance for pull requests.
* Stress test results: as stress tests take long, we may not be able to run all the tests for every commit. We opportunistically run tests as many as we can. We need to a way to report and track what tests have been run for each commit, and the results of it.
  * Static page generation with our own script.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance/stress CI rework #869

Requirements

Non goal

Design

Job Triggering

Job execution

Results storage

Visualization

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance/stress CI rework #869

Description

Requirements

Non goal

Design

Job Triggering

Job execution

Results storage

Visualization

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions