You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Both OpenBench and CuteChess use Bayesian Elo estimates (rather than the classical formula),
in order to "properly" adjust for draws, indeed CuteChess takes this even further by applying
an additional scaling based on draw rate in its SPRT calculation.
where $w'$ and $l'$ are the win and loss probabilities from a large sample of games.
Where does the draw elo come from?
Note: This is more of an educated guess.
The classical elo formula gives the expected score, $s$, of a player with a relative elo advantage
over an opponent of $e$, as
$$
s = f(e)
$$
Rearranging this to work out instead the expected elo advantage, given score, gives
$$
e = -400 \log_{10}(\frac{1}{s} - 1)
$$
and now it's easy to see that the draw elo is given by
$$
y = -\frac{1}{2} (e_w + e_l)
$$
where $e_w$ is the elo advantage from the perspective of the first player, given all draws
are counted as losses, and similarly $e_l$ is the elo advantage from the perspective of the
opponent, where all draws are counted as losses.
How SPRT actually works
So you have your old engine, and have made a change, resulting in a new engine.
Now, you make two a null and alternative hypothesis about the elo change $\Delta e$,
as with any other statistical test:
H0: $\Delta e = e_0$
H1: $\Delta e = e_1$
You of course have to decide on a margin of error, otherwise your test will never end;
the standard choices are $\alpha = 0.05$ and $\beta = 0.05$, which corresponds to a test
that concludes with 95% certainty.
These values define two bounds, $a = \log(\frac{\beta}{1 - \alpha})$ and
$b = \log(\frac{1 - \beta}{\alpha})$. With the two aforementioned values for $\alpha$ and $\beta$
this gives $a = -2.94$ and $b = 2.94$.
Now we start playing games.
After each game, we calculate the Log Likelihood Ratio (LLR) of the two hypotheses, and if it
exceeds $b$ we accept H1, if it is lower than $a$ we accept H0, and otherwise we continue
playing.
Calculating LLR
Definition
The formula for LLR is as follows
$$
LLR = LL(e_1) - LL(e_0)
$$
where $LL(x)$ is the Log-Likelihood of elo $x$.
Given a random variable $\mathbf{X} = (X_1, X_2, ...)$ with some distribution with parameter
$\theta$, the Likelihood Function of $\mathbf{X}$, based on a sample with outcome
$\mathbf{X} = \mathbf{x}$, is
Now let's calculate the Log-Likelihood of the outcome after $N$ games, say we have
$W$ wins, $L$ losses and $D = N - W - L$ draws, with an elo parameter of $e$.
Firstly we calculate the draw elo, as above, using our games