Description
TL;DR
I propose to unify the three different version numbers for codebase, leaderboard, and rules/documentation into a single semantic versioning scheme, Major.Minor.Patch
, where Major.Minor
is the same for all three objects. The Patch
version can be incremented for each object independently to allow flexibility. This makes it easier to have a simple statement like "we compare our method to the submissions of AlgoPerf version 0.5" that clearly defines which rules, codebase, and baseline submissions were used. Starting now, we will retrofit a (slightly modified) version of the inaugural competition to be v0.5.0 and develop the next iteration as v0.6.0 (or v0.7.0).
The problem with our current solution
Currently, we use three different versions across our codebase, leaderboard, and rules/documentation. This can be confusing, especially for external researchers.
- Codebase: The benchmark codebase (i.e., this repository) currently has version
0.1.6
(untagged). This version is accessible viaimport algoperf; print(algoperf.__version__)
. Tagged releases can be found here. - Documentation: The documentation for the benchmark rules uses a different version, currently
0.0.22
, as seen indocs/DOCUMENTATION.md
. - Leaderboard: Currently on version
0.6
, as shown in the submissions_algorithms repository.
This split makes it very difficult to determine:
- Which version of the rules or codebase was used to generate a specific leaderboard?
- Which codebase, rules, or leaderboard should a researcher use to compare their results to?
- How can they describe in a paper which AlgoPerf version they used? Use all three versions?
- How can we succinctly suggest to researchers which version of AlgoPerf they should use, e.g., for new external tuning submissions?
Proposed scheme: Unified Major.Minor.Patch
I propose to unify all three versions (i.e., codebase, leaderboard, documentation/rules) under a single, consistent Major.Minor.Patch
versioning scheme with the following guidelines:
Major.Minor
: This will be the primary benchmark version and will be consistent across the leaderboard, codebase, and rules/documentation.- Example: If the current benchmark version is
0.6
, then the codebase, rules, and leaderboard will all be of version0.6.x
. - All results generated under the same
Major.Minor
version should be (mostly) comparable. Someone writing a paper using benchmark version0.6
should compare their work against submissions from leaderboard version0.6
using the codebase version0.6
and the rules of version0.6
. - Following the suggestion from MLCommons, we can consider
0.5
as the version for the inaugural competition (maybe plus a few changes, see the open question below).
- Example: If the current benchmark version is
- Patch: This part of the version can be incremented independently for each component to reflect smaller, non-breaking changes to allow some flexibility:
- Leaderboard: New submissions or minor fixes to the leaderboard could increment its
Patch
version (e.g.,0.6.0
->0.6.1
) as shown in the leaderboard repo. - Codebase: API improvements, bug fixes, or small non-breaking changes in the benchmark code could increment its
Patch
version as reflected in thealgoperf
package version. - Documentation/Rules: Clarifications, typo fixes, or minor updates to the rules/documentation could increment its
Patch
version as shown in the documentation file.
- Leaderboard: New submissions or minor fixes to the leaderboard could increment its
We could reserve Major
version bumps (e.g., 0.6
-> 1.0
) for larger, more significant benchmark changes, such as adding a workload.
Suggested workflow
The suggested workflow depends on whether we are working on a released version or developing a new version:
- Working on a released version (e.g., patch releases like
0.5.0
->0.5.1
)
- Changes:
- Codebase: Implement bug fixes or minor, non-breaking improvements (e.g., changes to the plotting code, etc.). Update the git tag version automatically updates the
algoperf.__version__
of the package. - Documentation/Rules: Minor modifications like clarifications or typo fixes. Update the version in
docs/DOCUMENTATION.md
with the new patch version. - Leaderboard: For example, adding a new submission, correcting typos, or adding details could result in updating the patch version as documented in the
submissions_algorithms
repo.
- Codebase: Implement bug fixes or minor, non-breaking improvements (e.g., changes to the plotting code, etc.). Update the git tag version automatically updates the
- Changelog: Document all relevant codebase changes in the
CHANGELOG.md
.
- Developing a new
Major.Minor
version (e.g., working towards0.6.0
)
- Development Branch:
- All changes will be on the
dev
(ordev-0.6
or similar) branch. Only merge tomain
once we release. - For internal milestones, we could use pre-release labels like
-alpha.N
,-beta.N
or-rc.N
. - Iterative changes here, do not increment the
Minor
version, since we are working towards0.6.0
. - All changes should be documented in the
CHANGELOG.md
for the upcomingMinor
version release. This includes changes in the code and the rules.
- All changes will be on the
- Release new version:
- Check that
CHANGELOG.md
is up-to-date and complete. - Merge
dev
ordev-0.6
intomain
. - Tag release with new version.
- Check that
Open questions
My main open question is how we tag "older" versions. Since MLCommons suggested using 0.5
for the inaugural completion, I could see the following process to retrofit our repositories with the suggested versioning:
- Version
0.5.0
- The inaugural competition- This uses exactly the version used in the competition, e.g., currently tagged
0.1.5
. - This includes batch norm bugs, etc.
- The corresponding rules include held-out workloads, 5 studies, etc.
- The leaderboard contains all submissions from the competition, e.g., winners are Shampoo and Schedule-Free.
- This uses exactly the version used in the competition, e.g., currently tagged
- Version
0.6.0
- The modified (external tuning) version- Modification of
0.5.0
. - Includes all current bug fixes and API changes (e.g. batch norm,
prepare_for_eval
, etc.) - Updated rules: No held-out workloads, 3 studies, etc.
- Same runtime budgets as
0.5.0
! - Suggested version for external tuning submissions.
- Could use leaderboard from
0.5.0
with different scoring procedure (no held-out workloads, use the first 3 studies). Our scoring code could have ascoring_version
option that determines the precise scoring procedure).
- Modification of
- Version
0.7.0
- The future (self-tuning) version- Modification of
0.6.0
- Modified runtime budgets
- Suggested version for new self-tuning submissions.
- This leaderboard is currently empty.
- Modification of
We could consider merging either 1. & 2. or 2. & 3. However, I think this would be imprecise.
It is an open question whether we integrate changes like the upcoming pmap
to jit
modification to 0.6.0
and 0.7.0
?
Alternatively, we could combine 2. & 3. and hard-code different runtime budgets for the tuning tracks.