update WDL model and normalize eval dynamically #4920

robertnurnberg · 2023-12-14T09:01:47Z

Ensure that an evaluation of 100 centipawns always corresponds to a 50% win probability at fishtest LTC, irrespective of the move number.

This PR is the culmination of recent work by @Disservin, @vondele and myself on https://github.com/official-stockfish/WDL_model, as well as recent changes to https://github.com/official-stockfish/books and https://github.com/official-stockfish/fishtest.

The new model was fitted based on about 500M positions extracted from 7.9M fishtest LTC games from the last three weeks, involving SF versions from b59786e to current master.

A summary of the changes to the WDL model itself:

an incorrect 8-move shift in master's WDL model has been fixed
the polynomials p_a and p_b are fitted over the move range [8, 120]
the coefficients for p_a and p_b are optimized by maximizing the probability of predicting the observed outcome (credits to @vondele)

A summary of the changes to the SF code:

the internal evalulation is no longer normalized by p_a(32) (aka NormalizeToPawnValue), but by p_a(max(8, min(120, move))) (credits to @vondele, see https://github.com/vondele/Stockfish/tree/wdlScore)
the above means we can now retire NormalizeToPawnValue
in win_rate_model() we no longer clamp the internal eval to [-4000,4000]

This PR is in draft status for now, to allow for some discussion of the proposed changes.

No functional change.

robertnurnberg · 2023-12-14T09:02:52Z

Here is the corresponding visualization of the new WDL model. The new model and the graphics were produced with the help of the updateWDL.sh script from https://github.com/official-stockfish/WDL_model

The output of the script is as follows:

Look recursively in directory pgns for games from SPRT tests using books matching "UHO_4060_v..epd|UHO_Lichess_4852_v1.epd" for SF revisions between b59786e750a59d3d7cff2630cf284553f607ed29 (from 2023-11-20 19:00:47 +0100) and HEAD (from 2023-12-10 23:23:28 +0100).
Based on 492831440 positions, NormalizeToPawnValue should stay at 328.

The output of scoreWDL.py is:

Converting evals with NormalizeToPawnValue = 328.
Reading eval stats from updateWDL.json.
Retained (W,D,L) = (106769609, 276733947, 109327884) positions.
Fit WDL model based on move.
Initial objective function:  0.3283141930912294
Final objective function:    0.32830783432056276
Optimization terminated successfully.
const int NormalizeToPawnValue = 328;
Corresponding spread = 61;
Corresponding normalized spread = 0.18721849951279587;
Draw rate at 0.0 eval at move 32 = 0.9904668804343536;
Parameters in internal value units:
p_a = ((-1.832 * x / 32 + 12.999) * x / 32 + -14.953) * x / 32 + 332.187
p_b = ((-5.256 * x / 32 + 38.191) * x / 32 + -84.760) * x / 32 + 113.308
   constexpr double as[] = {-1.83236796, 12.99881028, -14.95254605, 332.18650913};
   constexpr double bs[] = {-5.25625834, 38.19089529, -84.75989479, 113.30788898};
Preparing plots.
Saved graphics to updateWDL.png.

robertnurnberg · 2023-12-14T09:03:54Z

Some points to be discussed @vondele :

are we happy to discard NormalizeToPawnValue? (for statistical purposes, external tools can always compute the (rounded) sum of the coefficients in as)
not clamping the internal eval to [-4000,4000] did not give any overflow errors in my local tests, so is it OK to drop it?
is uci.hpp the right place for the "100cp = 50% win probability" comment, or should it be moved to uci.cpp?

dubslow · 2023-12-14T14:40:45Z

are we happy to discard NormalizeToPawnValue? (for statistical purposes, external tools can always compute the (rounded) sum of the coefficients in as)

perhaps the code can be done away with, but leave a comment next to as with the value so that us silly humans can eyeball the average scale factor

robertnurnberg · 2023-12-14T15:42:33Z

perhaps the code can be done away with, but leave a comment next to as with the value so that us silly humans can eyeball the average scale factor

Hm, a comment would be quite fragile and in the long run could get out of sync with the actual code. With git blame the "offending" PR should be easy to locate, and good practice would be to mention the value in the PR message or comments. As in #4920 (comment) above. So IMO it's fine as it is, but happy to hear from maintainers.

cmwetherell · 2023-12-18T01:34:12Z

Cool PR! Funny enough, I was poking through the code today wondering how and why "move 32" was picked, instead of just normalizing in some fashion so all moves would be 50% @ 100 CP.

Has there been research on why the Internal Value units drift higher for the same win-rate later in the game? I think this result implies the model error changes by move number. Why?

Perhaps SF would be better if this were not the case - right now, it may be overestimating win chances in deeper search trees relative to shallow search trees.

So this comment is necessarily related to merging the PR, but the diagnostic charts above do give some insight that are worth researching. Let me know if you have thoughts or would prefer I move this to a "Discussion".

edit: In case it's not clear what was meant by "Internal Value units drift higher", I mean that in the diagnostic heat maps, we can see that at move 120 it takes more "IVUs" to achieve a certain win probability than at move 30.

Craftyawesome · 2023-12-18T05:06:13Z

Should there be any effort to try to handle fens with wrong move numbers? A lot will report move 1 despite being middlegame/endgame. Maybe fall back to fixed ply if fen is move 1 and not a DFRC starting position? Or maybe fitting by piece count has improved since it was last tried?

cmwetherell · 2023-12-18T07:44:52Z

Should there be any effort to try to handle fens with wrong move numbers? A lot will report move 1 despite being middlegame/endgame. Maybe fall back to fixed ply if fen is move 1 and not a DFRC starting position? Or maybe fitting by piece count has improved since it was last tried?

This is tangentially related to my questions above. If the expected score of a position was consistent, regardless of move number, why would we need any of this? Does SFNNv8 use move number as an input? If the evaluation function isn’t aware of the move number, should it be?

robertnurnberg · 2023-12-18T08:25:53Z

Has there been research on why the Internal Value units drift higher for the same win-rate later in the game? I think this result implies the model error changes by move number. Why?

A possible explanation is that at high move counters, say near move 120, the proportion of fortress-type positions is higher, and that these may have many mis-evals at fishtest LTC, leading to a larger value of a.

robertnurnberg · 2023-12-18T08:30:23Z

Should there be any effort to try to handle fens with wrong move numbers? A lot will report move 1 despite being middlegame/endgame. Maybe fall back to fixed ply if fen is move 1 and not a DFRC starting position? Or maybe fitting by piece count has improved since it was last tried?

I think handling FENs with wrong move numbers belongs to the user space, and is not the responsibility of SF. I agree that a material based model would avoid this issue completely. Unfortunately, the graphs for a and b based on material count at first glance are less smooth.

Maybe a better place to discuss the merits of the material based fitting is on the WDL repo, e.g. here: official-stockfish/WDL_model#152

Craftyawesome · 2023-12-18T23:19:35Z

A possible explanation is that at high move counters, say near move 120, the proportion of fortress-type positions is higher, and that these may have many mis-evals at fishtest LTC, leading to a larger value of a.

Yes, I suspect this is the majority of the reason. This may be testable? What if we filter out games with repeated evals for 3+ moves?

On the other hand, part of the dataset for training the net is sf data, so if there is bias there it would still likely affect new nets. Also AFAIK even for leela data the formula used to convert is not sharpness independent, so there could be some bias there too.

I think handling FENs with wrong move numbers belongs to the user space, and is not the responsibility of SF.

IDK, a lot of fens include the 0 1 part regardless of if that's accurate. I don't think you can reasonably expect a GUI to disregard an explicit move number. And for things like manually placing pieces the move number is often shown and editable and therefore explicit. I could technically see things like fens without movenumbers and board scanners guessing the movenumber, but I can't imagine most if any would bother. IMO leaving it to user space is just accepting a regression in interpretability for many fens. If you think this is worth it for the increase of interpretability for correct fens/pgns then fine, but I wouldn't expect user space to help this.

robertnurnberg · 2023-12-19T07:58:06Z

A possible explanation is that at high move counters, say near move 120, the proportion of fortress-type positions is higher, and that these may have many mis-evals at fishtest LTC, leading to a larger value of a.

Yes, I suspect this is the majority of the reason. This may be testable? What if we filter out games with repeated evals for 3+ moves?

Where do you want to apply the filter? When we fit the WDL model? That would be wrong, because the model should match the playing SF, and these fortress mis-evals are part of it. For the example at hand, it would be wrong not to assign 100cp to an internal eval of 350 at move 120, because the data says that SF only wins 50% of those games at fishtest LTC against a similarly strong opponent.

I think handling FENs with wrong move numbers belongs to the user space, and is not the responsibility of SF.

IDK, a lot of fens include the 0 1 part regardless of if that's accurate. I don't think you can reasonably expect a GUI to disregard an explicit move number. And for things like manually placing pieces the move number is often shown and editable and therefore explicit. I could technically see things like fens without movenumbers and board scanners guessing the movenumber, but I can't imagine most if any would bother. IMO leaving it to user space is just accepting a regression in interpretability for many fens. If you think this is worth it for the increase of interpretability for correct fens/pgns then fine, but I wouldn't expect user space to help this.

I strongly believe that SF internally changing the fullmove number given in the FEN, or assigning anything else than 0 1 to incomplete FENs, is a complete no-no.

Craftyawesome · 2023-12-19T08:26:17Z

Where do you want to apply the filter? When we fit the WDL model? That would be wrong, because the model should match the playing SF, and these fortress mis-evals are part of it. For the example at hand, it would be wrong not to assign 100cp to an internal eval of 350 at move 120, because the data says that SF only wins 50% of those games at fishtest LTC against a similarly strong opponent.

I wasn't proposing doing this for the actual model, only for getting information. I was just curious if a increasing with ply was primarily/fully a result of fortress positions. As in remove all games with flatlines and see if the resulting WDL model has a much flatter a.

I strongly believe that SF internally changing the fullmove number given in the FEN, or assigning anything else than 0 1 to incomplete FENs, is a complete no-no.

That's an alternative, but I was actually thinking if something is done in SF it could just fall back to NormalizeToPawnValue when the fen is move 1 and not DFRC starting pos. If done in user space outside of SF then changing the move number is the only option.

robertnurnberg · 2023-12-19T08:39:12Z

That's an alternative, but I was actually thinking if something is done in SF it could just fall back to NormalizeToPawnValue when the fen is move 1 and not DFRC starting pos. If done in user space outside of SF then changing the move number is the only option.

Ok, that is an interesting proposal. I will wait and see what SF maintainers think about that.

robertnurnberg · 2024-01-10T07:28:47Z

In discussion with @vondele on discord it was decided to stick with the static eval renormalization in master for now. I will be opening a corresponding PR in time for the planned 16.1 release, which will have the following changes:

Model:

an incorrect 8-move shift in master's WDL model has been fixed
the polynomials p_a and p_b are fitted over the move range [8, 120]
the coefficients for p_a and p_b are optimized by maximizing the probability of predicting the observed outcome

SF code:

for wdl values, move will be clamped to max(8, min(120, move))
no longer clamp the internal eval to [-4000,4000]
compute NormalizeToPawnValue with round, not trunc

Closing this PR for now. It can be re-opened after the 16.1 release, or be replaced with a dynamic eval renormalization based on material count (rather than full move counter).

This PR proposes to change the parameter dependence of Stockfish's internal WDL model from full move counter to material count. In addition it ensures that an evaluation of 100 centipawns always corresponds to a 50% win probability at fishtest LTC, whereas for master this holds only at move number 32. See also #4920 and the discussion therein. The new model was fitted based on about 340M positions extracted from 5.6M fishtest LTC games from the last three weeks, involving SF versions from e67cc97 (SF 16.1) to current master. The involved commands are for [WDL_model](https://github.com/official-stockfish/WDL_model) are: ``` ./updateWDL.sh --firstrev e67cc97 python scoreWDL.py updateWDL.json --plot save --pgnName update_material.png --momType "material" --momTarget 58 --materialMin 10 --modelFitting optimizeProbability ``` The anchor `58` for the material count value was chosen to be as close as possible to the observed average material count of fishtest LTC games at move 32 (`43`), while not changing the value of `NormalizeToPawnValue` compared to the move-based WDL model by more than 1. The patch only affects the displayed cp and wdl values. closes #5121 No functional change

This PR proposes to change the parameter dependence of Stockfish's internal WDL model from full move counter to material count. In addition it ensures that an evaluation of 100 centipawns always corresponds to a 50% win probability at fishtest LTC, whereas for master this holds only at move number 32. See also official-stockfish#4920 and the discussion therein. The new model was fitted based on about 340M positions extracted from 5.6M fishtest LTC games from the last three weeks, involving SF versions from e67cc97 (SF 16.1) to current master. The involved commands are for [WDL_model](https://github.com/official-stockfish/WDL_model) are: ``` ./updateWDL.sh --firstrev e67cc97 python scoreWDL.py updateWDL.json --plot save --pgnName update_material.png --momType "material" --momTarget 58 --materialMin 10 --modelFitting optimizeProbability ``` The anchor `58` for the material count value was chosen to be as close as possible to the observed average material count of fishtest LTC games at move 32 (`43`), while not changing the value of `NormalizeToPawnValue` compared to the move-based WDL model by more than 1. The patch only affects the displayed cp and wdl values. closes official-stockfish#5121 No functional change

update WDL model and normalize eval dynamically

1ff4f61

#include <utility>

28b11ff

This was referenced Dec 16, 2023

introduce cli parameters for the update script official-stockfish/WDL_model#148

Merged

move log files into separate directory official-stockfish/WDL_model#121

Closed

Merge branch 'master' into wdldynamic

982733f

robertnurnberg mentioned this pull request Dec 21, 2023

enable dynamic renormalization in updateWDL.sh official-stockfish/WDL_model#153

Merged

robertnurnberg closed this Jan 10, 2024

robertnurnberg mentioned this pull request Mar 17, 2024

base WDL model on material count and normalize evals dynamically #5121

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update WDL model and normalize eval dynamically #4920

update WDL model and normalize eval dynamically #4920

robertnurnberg commented Dec 14, 2023

robertnurnberg commented Dec 14, 2023

robertnurnberg commented Dec 14, 2023

dubslow commented Dec 14, 2023

robertnurnberg commented Dec 14, 2023

cmwetherell commented Dec 18, 2023 •

edited

Loading

Craftyawesome commented Dec 18, 2023

cmwetherell commented Dec 18, 2023

robertnurnberg commented Dec 18, 2023 •

edited

Loading

robertnurnberg commented Dec 18, 2023

Craftyawesome commented Dec 18, 2023

robertnurnberg commented Dec 19, 2023

Craftyawesome commented Dec 19, 2023

robertnurnberg commented Dec 19, 2023

robertnurnberg commented Jan 10, 2024 •

edited

Loading

update WDL model and normalize eval dynamically #4920

update WDL model and normalize eval dynamically #4920

Conversation

robertnurnberg commented Dec 14, 2023

robertnurnberg commented Dec 14, 2023

robertnurnberg commented Dec 14, 2023

dubslow commented Dec 14, 2023

robertnurnberg commented Dec 14, 2023

cmwetherell commented Dec 18, 2023 • edited Loading

Craftyawesome commented Dec 18, 2023

cmwetherell commented Dec 18, 2023

robertnurnberg commented Dec 18, 2023 • edited Loading

robertnurnberg commented Dec 18, 2023

Craftyawesome commented Dec 18, 2023

robertnurnberg commented Dec 19, 2023

Craftyawesome commented Dec 19, 2023

robertnurnberg commented Dec 19, 2023

robertnurnberg commented Jan 10, 2024 • edited Loading

cmwetherell commented Dec 18, 2023 •

edited

Loading

robertnurnberg commented Dec 18, 2023 •

edited

Loading

robertnurnberg commented Jan 10, 2024 •

edited

Loading