Ranking system for ladder rankings where games are won in best out of X rounds.
The input is supplied to stdin in a comma-separated format:
user_a_id,user_b_id,user_a_rounds,user_b_rounds
You can provide one game per line or have rounds between two users already accumulated.
It is important to end with a new line, otherwise the last line is ignored
$ printf "1,2,3,0\n2,1,3,2\n3,1,2,1\n" | go run .
3,38.7940217999804307425334034878722275274658
1,94.4733759805052229464094241552471635973446
2,54.2679612557235192475522802224982500417932
It returns the user ID and its calculated rating as in user,rating
per line.
The ranking system relies on rounds won and doesn’t care about games. You can have even drawn games. \
There are multiple environment variables to tweak the execution:
DEBUG=1
logs debug information to stderr.
Set RELRANK_RELREL
, RELRANK_RELSTEPS
to change relRel
, relSteps
settings.
RELRANK_PREC
to change decimal precision.
RELRANK_SCALE_MAX
to rescale the ratings by setting the maximum rating.
The detailed configuration of each of these settings can be understood by reading on.
An original implementation has been as part of Worms League. It was battle-tested and performed convincingly.
Now there is a self-contained CLI application written in Go. This application should allow for usage outside of Worms League or any particular league so that it can supplied only with the games that were played and then output the ranking.
Go is also faster than PHP and a better suit for a standalone CLI application.
In the public world one is accustomed to absolute ranking system. These are self-contained perfect. Everyone plays everyone the same amount of times. Therefore there’s no need in complexity to assess a points pattern other than addition. The fixtures are inherently perfect.
This is not typically the case in online games where there’s a fixed schedule and not everyone is necessarily playing everyone the same amount of times. So players may never clash, others multiple times. An absolute ranking would make players end up top who win most of their games. In other words, possibly the most active or the one playing the weakest players.
Therefore the series of a user has to be evaluated within a context that’s relative to all the other contenders in.
The most blatant example is a victory against a higher ranked other player should get one more points than against a lower ranked player.
In a relative ranking system the points of a user are determined by the points of the opponents one has played against. The question is how to you initially determine the points of another user when they are not yet ranked.
The most fundamental information that an absolute ranking could be generated from is the total won rounds. One gets one point per won round.
This is now an absolute state. The idea of the relative ranking system is to put this absolute rating of each players into perspective — into context.
Looking at the number of won rounds of each player under which circumstances
were they actually attained?
One is more likely to add won rounds based on these factors:
- level of opponents (
quality
) - number of rounds (
farming
) - total number of rounds (
effort
)
These three factors are relative among themselves again.
For instance the more number of rounds against the same opponent of a weaker
skill level, the more likely one is to gain won rounds and vice versa.
If one is generally playing more rounds, it’s more likely to accumulate won
rounds, too.
These are the three main factors used in the relative ranking system. These can be referenced to put the absolute value of won rounds into a context. Therefore they relativize an absolute number and are thus called “relativizers.”
The quality
relativizer relativizes won rounds in the context of how well the
opponents of that user are rated.
This is perhaps the most obvious relativizer valuing won rounds according to the
rating of the respective opponent.
This is averaged so that number of rounds per opponent are accounted for proportionally.
Here is the opponent’s position to relativization plot for a total of
The highest rated player — being at position 16 in
Exponential makes sense here because ratings for players at the top tend to be more robust. For example the difference in skill of a player is likely to be greater between first and third place than it is for 20th and 23rd.
One fact special to this relativizer is that its input parameters possibly change with each step. Since an input parameter is the opponent’s rating and it may change with each step.
The farming
relativizer relativizes won rounds in the context of how often
that user played against the same opponent.
Each won round weighing less than the previous.
This retaliates the infamous “noob bashing” where a better
player repeatedly defeats a very low rated player to farm points from.
Additionally this promotes users to have diverse pairings.
Certainly a user’s rating is easier to assess the more data is available.
Theoretically you can gain more points from defeating a low rated player for the
first time than gaining a hundredth won round against a high rated player.
The relativizer is made so that per opponent the first round is always worth farming
relativizer.
It is given to the round of the maximum won rounds any player has against another.
This value is
This is the important concept here.
Scaling the value of won rounds from
Starting with the fundamental logarithmic function:
Adding
Now the curve needs to be adjusted so that
The formula ends up like this:
The boundaries are for the first round
and maximum round respectively
with everything in between decreasing towars
This formula is applied to each round per opponent so that
This is then averaged across the total number of won rounds. The final formula to output the factor of relativization:
The quality
relativizer relativizes won rounds in the context of how many
rounds in total were played to attain the portion of won rounds.
The total rounds played of each user
Each calculation of a relative ranking starts with an absolute ranking. They’re put into relation using relativizers.
Since two perspectives — an absolute and a relative — are put against each other, it is necessary to balance them out.
The more relativity is “applied” to the absolute information, the less the original absolute value has any weight in the final outcome. Potentially leading to an entirely unfounded result as the relativization process to relativize itself more and more.
The relative ranking system applies relativization steps. With each new step the previous value is relativized further. Each output is the input of the next step. As aforementioned the initial input is the absolute value of won rounds.
Each step is the same relativization algorithm applied to the input.
To further fine-tune the balancing of relativization relative ranking system uses two configuration parameters.
relSteps
is used in a formula to determine the total number of relativization
steps.
Consider relSteps
and
S
is solved for the number of relativization steps to take.
Assuming relSteps
is
The user with the maximum number of rounds played on the y-axis and what number of relativization steps it would cause on the y-axis.
The graph shows how the number of steps rises logarithmically.
The greater relSteps
the steeper the rise.
Hence a greater relSteps
factor causes the relativization to be more effect
thus weighing in more in the balancing ob the absolute number versus the
relativized result.
There has got to be at least one relativization step and
relRel
is used in a formula to influence the impact of each individual
relativization step.
A relativizer returns a decimal that the number to be relativized is multiplied with. The number to be relativized is always the current rating of that user. In the first relativization step that is the absolute number of won rounds of that user in all subsequent steps it is the rating of that user.
In other words: Prior to any relativization, the rating of a user is simply that user’s won rounds, in all subsequent steps it is the relativization result of the previous step. In the first step the won rounds are relativized and that result is to be relativized in the next step and so on and so forth.
In that example the three won rounds of a user are passed through four relativization steps determining the final rating of the user.
Here the decimals
Here relRel
configuration comes into play.
It is added to the relativization and then the average of these result in the
final relativization.
In practice the average of three relativizers (as mentioned earlier) and
relRel
form the final decimal that the rating of the current step is
relativized by.
Therefore within reach step the rating of the previous round is multiplied by
the average of all the relativizers including relRel
to form the new rating
which is the final rating depending on whether all steps have completed.
The algorithm produces rather insane numbers. For example a user with
and even more decimal places depending on the scale that is set for decimals.
To make the numbers more pleasing for displaying to the end-user in a ladder the numbers can be rescaled using min-max normalization.
Consider
To make numbers increase as more rounds are played one can make
In the relative ranking system everyone’s points may change with just one new round played.
The maximum rating of a use according to which to scale all other users’
points can be set with the RELRANK_SCALE_MAX
environment variable.
The algorithm is tested against another ranking system that is regarded as the best in Worms Armageddon online play.
The current specifics is that the root mean-squared error is
Many of the magic numbers in the formulas provided in this documentation originate from experimentation to drive down the root mean-squared error toward that target system.
This metric is interesting and human-readable. This is the average difference of a user’s rating compared to the target system.
This metric isn’t really human-readable because like mean absolute error but
it’s defining metric to measure how well the algorithm performs.
The edge it has over MAE that due to it weighs big differences heavier than
small ones.
There is testing in place making sure RMSE doesn’t increase.
Tests are run by the validate.py
script.
Here is is run with ranking of users with at least one won round (MINW=1
) for
target seasons 39, 40, 41 and 42.
Note that the number of games played per season is very different. Seasons with many games played generally have greater ratings and lead to greater MAE and RMSE.
$ DEBUG=0 PERM=0 MINW=1 python3 validate.py
minw 1
env DEBUG=0 RELRANK_ROUND=2
season 39
39 MAE 25.55278493305227
39 RMSE 36.456688788966446
season 40
40 MAE 14.376024564718678
40 RMSE 20.091997886647718
season 41
41 MAE 0.7738193555490799
41 RMSE 0.8859498890892478
season 42
42 MAE 25.358800743463515
42 RMSE 36.73211211228584
average
MAE 16.515357399195885
RMSE 23.54168716924731