Proposal: Unify Versioning for Codebase, Leaderboard, and Rules

**TL;DR**
I propose to unify the three different version numbers for codebase, leaderboard, and rules/documentation into a single semantic versioning scheme, `Major.Minor.Patch`, where `Major.Minor` is the same for all three objects. The `Patch` version can be incremented for each object independently to allow flexibility. This makes it easier to have a simple statement like "we compare our method to the submissions of AlgoPerf version 0.5" that clearly defines which rules, codebase, and baseline submissions were used. Starting now, we will retrofit a (slightly modified) version of the inaugural competition to be **v0.5.0** and develop the next iteration as **v0.6.0** (or v0.7.0).

## The problem with our current solution

Currently, we use three different versions across our codebase, leaderboard, and rules/documentation. This can be confusing, especially for external researchers.

1. **Codebase:** The benchmark codebase (i.e., this repository) currently has version `0.1.6` (untagged). This version is accessible via `import algoperf; print(algoperf.__version__)`. Tagged releases can be found [here](https://github.com/mlcommons/algorithmic-efficiency/releases).
2. **Documentation:** The documentation for the benchmark rules uses a different version, currently `0.0.22`, as seen in [`docs/DOCUMENTATION.md`](https://github.com/mlcommons/algorithmic-efficiency/blob/main/docs/DOCUMENTATION.md).
3. **Leaderboard:** Currently on version `0.6`, as shown in the [submissions_algorithms repository](https://github.com/mlcommons/submissions_algorithms).

This split makes it very difficult to determine:

- Which version of the rules or codebase was used to generate a specific leaderboard?
- Which codebase, rules, or leaderboard should a researcher use to compare their results to?
- How can they describe in a paper which AlgoPerf version they used? Use all three versions?
- How can we succinctly suggest to researchers which version of AlgoPerf they should use, e.g., for new external tuning submissions?

## Proposed scheme: Unified `Major.Minor.Patch`

I propose to unify all three versions (i.e., codebase, leaderboard, documentation/rules) under a single, consistent `Major.Minor.Patch` versioning scheme with the following guidelines:

- `Major.Minor`: This will be the primary benchmark version and will be consistent across the leaderboard, codebase, and rules/documentation.
  - Example: If the current benchmark version is `0.6`, then the codebase, rules, and leaderboard will all be of version `0.6.x`.
  - All results generated under the same `Major.Minor` version should be (mostly) comparable. Someone writing a paper using benchmark version `0.6` should compare their work against submissions from leaderboard version `0.6` using the codebase version `0.6` and the rules of version `0.6`.
  - Following the suggestion from MLCommons, we can consider `0.5` as the version for the inaugural competition (maybe plus a few changes, see the open question below).
- Patch: This part of the version can be incremented independently for each component to reflect smaller, non-breaking changes to allow some flexibility:
  - _Leaderboard_: New submissions or minor fixes to the leaderboard could increment its `Patch` version (e.g., `0.6.0` -> `0.6.1`) as shown in the leaderboard repo.
  - _Codebase_: API improvements, bug fixes, or small non-breaking changes in the benchmark code could increment its `Patch` version as reflected in the `algoperf ` package version.
  - _Documentation/Rules_: Clarifications, typo fixes, or minor updates to the rules/documentation could increment its `Patch` version as shown in the documentation file.

We could reserve `Major` version bumps (e.g., `0.6` -> `1.0`) for larger, more significant benchmark changes, such as adding a workload.

## Suggested workflow

The suggested workflow depends on whether we are working on a _released_ version or developing a _new_ version:

1. Working on a _released_ version (e.g., patch releases like `0.5.0` -> `0.5.1`)
  - **Changes:**
    - _Codebase:_ Implement bug fixes or minor, non-breaking improvements (e.g., changes to the plotting code, etc.). Update the git tag version automatically updates the `algoperf.__version__` of the package.
    - _Documentation/Rules:_ Minor modifications like clarifications or typo fixes. Update the version in `docs/DOCUMENTATION.md` with the new patch version.
    - _Leaderboard:_ For example, adding a new submission, correcting typos, or adding details could result in updating the patch version as documented in the `submissions_algorithms` repo.
  - **Changelog:** Document all relevant codebase changes in the `CHANGELOG.md`.

2. Developing a _new_ `Major.Minor` version (e.g., working towards `0.6.0`)
  - **Development Branch:**
    - All changes will be on the `dev` (or `dev-0.6` or similar) branch. Only merge to `main` once we release.
    - For internal milestones, we could use pre-release labels like `-alpha.N`, `-beta.N` or `-rc.N`.
    - Iterative changes here, do not increment the `Minor` version, since we are working _towards_ `0.6.0`.
    - All changes should be documented in the `CHANGELOG.md` for the upcoming `Minor` version release. This includes changes in the code and the rules.
  - **Release new version:**
    - Check that `CHANGELOG.md` is up-to-date and complete.
    - Merge `dev` or `dev-0.6` into `main`.
    - Tag release with new version.

## Open questions

My main open question is how we tag "older" versions. Since MLCommons suggested using `0.5` for the inaugural completion, I could see the following process to retrofit our repositories with the suggested versioning:

1. Version `0.5.0` - **The inaugural competition**
    - This uses exactly the version used in the competition, e.g., currently tagged `0.1.5`.
    - This includes batch norm bugs, etc.
    - The corresponding rules include held-out workloads, 5 studies, etc.
    - The leaderboard contains all submissions from the competition, e.g., winners are Shampoo and Schedule-Free.
2. Version `0.6.0` - **The modified (external tuning) version**
    - Modification of `0.5.0`.
    - Includes all current bug fixes and API changes (e.g. batch norm, `prepare_for_eval`, etc.)
    - Updated rules: No held-out workloads, 3 studies, etc.
    - **Same runtime budgets as `0.5.0`!**
    - Suggested version for external tuning submissions.
    - Could use leaderboard from `0.5.0` with different scoring procedure (no held-out workloads, use the first 3 studies). Our scoring code could have a `scoring_version` option that determines the precise scoring procedure).
3. Version `0.7.0` - **The future (self-tuning) version**
    - Modification of `0.6.0`
    - **Modified runtime budgets**
    - Suggested version for new self-tuning submissions.
    - This leaderboard is currently empty.

We could consider merging either 1. & 2. or 2. & 3. However, I think this would be imprecise.
It is an open question whether we integrate changes like the upcoming `pmap` to `jit` modification to `0.6.0` and `0.7.0`?
Alternatively, we could combine 2. & 3. and hard-code different runtime budgets for the tuning tracks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Unify Versioning for Codebase, Leaderboard, and Rules #870

The problem with our current solution

Proposed scheme: Unified `Major.Minor.Patch`

Suggested workflow

Open questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Unify Versioning for Codebase, Leaderboard, and Rules #870

Description

The problem with our current solution

Proposed scheme: Unified Major.Minor.Patch

Suggested workflow

Open questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Proposed scheme: Unified `Major.Minor.Patch`