Skip to content

Proposal: Unify Versioning for Codebase, Leaderboard, and Rules #870

@fsschneider

Description

@fsschneider

TL;DR
I propose to unify the three different version numbers for codebase, leaderboard, and rules/documentation into a single semantic versioning scheme, Major.Minor.Patch, where Major.Minor is the same for all three objects. The Patch version can be incremented for each object independently to allow flexibility. This makes it easier to have a simple statement like "we compare our method to the submissions of AlgoPerf version 0.5" that clearly defines which rules, codebase, and baseline submissions were used. Starting now, we will retrofit a (slightly modified) version of the inaugural competition to be v0.5.0 and develop the next iteration as v0.6.0 (or v0.7.0).

The problem with our current solution

Currently, we use three different versions across our codebase, leaderboard, and rules/documentation. This can be confusing, especially for external researchers.

  1. Codebase: The benchmark codebase (i.e., this repository) currently has version 0.1.6 (untagged). This version is accessible via import algoperf; print(algoperf.__version__). Tagged releases can be found here.
  2. Documentation: The documentation for the benchmark rules uses a different version, currently 0.0.22, as seen in docs/DOCUMENTATION.md.
  3. Leaderboard: Currently on version 0.6, as shown in the submissions_algorithms repository.

This split makes it very difficult to determine:

  • Which version of the rules or codebase was used to generate a specific leaderboard?
  • Which codebase, rules, or leaderboard should a researcher use to compare their results to?
  • How can they describe in a paper which AlgoPerf version they used? Use all three versions?
  • How can we succinctly suggest to researchers which version of AlgoPerf they should use, e.g., for new external tuning submissions?

Proposed scheme: Unified Major.Minor.Patch

I propose to unify all three versions (i.e., codebase, leaderboard, documentation/rules) under a single, consistent Major.Minor.Patch versioning scheme with the following guidelines:

  • Major.Minor: This will be the primary benchmark version and will be consistent across the leaderboard, codebase, and rules/documentation.
    • Example: If the current benchmark version is 0.6, then the codebase, rules, and leaderboard will all be of version 0.6.x.
    • All results generated under the same Major.Minor version should be (mostly) comparable. Someone writing a paper using benchmark version 0.6 should compare their work against submissions from leaderboard version 0.6 using the codebase version 0.6 and the rules of version 0.6.
    • Following the suggestion from MLCommons, we can consider 0.5 as the version for the inaugural competition (maybe plus a few changes, see the open question below).
  • Patch: This part of the version can be incremented independently for each component to reflect smaller, non-breaking changes to allow some flexibility:
    • Leaderboard: New submissions or minor fixes to the leaderboard could increment its Patch version (e.g., 0.6.0 -> 0.6.1) as shown in the leaderboard repo.
    • Codebase: API improvements, bug fixes, or small non-breaking changes in the benchmark code could increment its Patch version as reflected in the algoperf package version.
    • Documentation/Rules: Clarifications, typo fixes, or minor updates to the rules/documentation could increment its Patch version as shown in the documentation file.

We could reserve Major version bumps (e.g., 0.6 -> 1.0) for larger, more significant benchmark changes, such as adding a workload.

Suggested workflow

The suggested workflow depends on whether we are working on a released version or developing a new version:

  1. Working on a released version (e.g., patch releases like 0.5.0 -> 0.5.1)
  • Changes:
    • Codebase: Implement bug fixes or minor, non-breaking improvements (e.g., changes to the plotting code, etc.). Update the git tag version automatically updates the algoperf.__version__ of the package.
    • Documentation/Rules: Minor modifications like clarifications or typo fixes. Update the version in docs/DOCUMENTATION.md with the new patch version.
    • Leaderboard: For example, adding a new submission, correcting typos, or adding details could result in updating the patch version as documented in the submissions_algorithms repo.
  • Changelog: Document all relevant codebase changes in the CHANGELOG.md.
  1. Developing a new Major.Minor version (e.g., working towards 0.6.0)
  • Development Branch:
    • All changes will be on the dev (or dev-0.6 or similar) branch. Only merge to main once we release.
    • For internal milestones, we could use pre-release labels like -alpha.N, -beta.N or -rc.N.
    • Iterative changes here, do not increment the Minor version, since we are working towards 0.6.0.
    • All changes should be documented in the CHANGELOG.md for the upcoming Minor version release. This includes changes in the code and the rules.
  • Release new version:
    • Check that CHANGELOG.md is up-to-date and complete.
    • Merge dev or dev-0.6 into main.
    • Tag release with new version.

Open questions

My main open question is how we tag "older" versions. Since MLCommons suggested using 0.5 for the inaugural completion, I could see the following process to retrofit our repositories with the suggested versioning:

  1. Version 0.5.0 - The inaugural competition
    • This uses exactly the version used in the competition, e.g., currently tagged 0.1.5.
    • This includes batch norm bugs, etc.
    • The corresponding rules include held-out workloads, 5 studies, etc.
    • The leaderboard contains all submissions from the competition, e.g., winners are Shampoo and Schedule-Free.
  2. Version 0.6.0 - The modified (external tuning) version
    • Modification of 0.5.0.
    • Includes all current bug fixes and API changes (e.g. batch norm, prepare_for_eval, etc.)
    • Updated rules: No held-out workloads, 3 studies, etc.
    • Same runtime budgets as 0.5.0!
    • Suggested version for external tuning submissions.
    • Could use leaderboard from 0.5.0 with different scoring procedure (no held-out workloads, use the first 3 studies). Our scoring code could have a scoring_version option that determines the precise scoring procedure).
  3. Version 0.7.0 - The future (self-tuning) version
    • Modification of 0.6.0
    • Modified runtime budgets
    • Suggested version for new self-tuning submissions.
    • This leaderboard is currently empty.

We could consider merging either 1. & 2. or 2. & 3. However, I think this would be imprecise.
It is an open question whether we integrate changes like the upcoming pmap to jit modification to 0.6.0 and 0.7.0?
Alternatively, we could combine 2. & 3. and hard-code different runtime budgets for the tuning tracks.

Metadata

Metadata

Labels

✨ Feature RequestRequest for a new feature or enhancement of an existing one

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions