Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update WDL model and normalize eval dynamically #4920

Closed

Conversation

robertnurnberg
Copy link
Contributor

Ensure that an evaluation of 100 centipawns always corresponds to a 50% win probability at fishtest LTC, irrespective of the move number.

This PR is the culmination of recent work by @Disservin, @vondele and myself on https://github.com/official-stockfish/WDL_model, as well as recent changes to https://github.com/official-stockfish/books and https://github.com/official-stockfish/fishtest.

The new model was fitted based on about 500M positions extracted from 7.9M fishtest LTC games from the last three weeks, involving SF versions from b59786e to current master.

A summary of the changes to the WDL model itself:

  • an incorrect 8-move shift in master's WDL model has been fixed
  • the polynomials p_a and p_b are fitted over the move range [8, 120]
  • the coefficients for p_a and p_b are optimized by maximizing the probability of predicting the observed outcome (credits to @vondele)

A summary of the changes to the SF code:

  • the internal evalulation is no longer normalized by p_a(32) (aka NormalizeToPawnValue), but by p_a(max(8, min(120, move))) (credits to @vondele, see https://github.com/vondele/Stockfish/tree/wdlScore)
  • the above means we can now retire NormalizeToPawnValue
  • in win_rate_model() we no longer clamp the internal eval to [-4000,4000]

This PR is in draft status for now, to allow for some discussion of the proposed changes.

No functional change.

@robertnurnberg
Copy link
Contributor Author

Here is the corresponding visualization of the new WDL model. The new model and the graphics were produced with the help of the updateWDL.sh script from https://github.com/official-stockfish/WDL_model

The output of the script is as follows:

Look recursively in directory pgns for games from SPRT tests using books matching "UHO_4060_v..epd|UHO_Lichess_4852_v1.epd" for SF revisions between b59786e750a59d3d7cff2630cf284553f607ed29 (from 2023-11-20 19:00:47 +0100) and HEAD (from 2023-12-10 23:23:28 +0100).
Based on 492831440 positions, NormalizeToPawnValue should stay at 328.

The output of scoreWDL.py is:

Converting evals with NormalizeToPawnValue = 328.
Reading eval stats from updateWDL.json.
Retained (W,D,L) = (106769609, 276733947, 109327884) positions.
Fit WDL model based on move.
Initial objective function:  0.3283141930912294
Final objective function:    0.32830783432056276
Optimization terminated successfully.
const int NormalizeToPawnValue = 328;
Corresponding spread = 61;
Corresponding normalized spread = 0.18721849951279587;
Draw rate at 0.0 eval at move 32 = 0.9904668804343536;
Parameters in internal value units:
p_a = ((-1.832 * x / 32 + 12.999) * x / 32 + -14.953) * x / 32 + 332.187
p_b = ((-5.256 * x / 32 + 38.191) * x / 32 + -84.760) * x / 32 + 113.308
   constexpr double as[] = {-1.83236796, 12.99881028, -14.95254605, 332.18650913};
   constexpr double bs[] = {-5.25625834, 38.19089529, -84.75989479, 113.30788898};
Preparing plots.
Saved graphics to updateWDL.png.

updateWDL

@robertnurnberg
Copy link
Contributor Author

Some points to be discussed @vondele :

  • are we happy to discard NormalizeToPawnValue? (for statistical purposes, external tools can always compute the (rounded) sum of the coefficients in as)
  • not clamping the internal eval to [-4000,4000] did not give any overflow errors in my local tests, so is it OK to drop it?
  • is uci.hpp the right place for the "100cp = 50% win probability" comment, or should it be moved to uci.cpp?

@dubslow
Copy link
Contributor

dubslow commented Dec 14, 2023

are we happy to discard NormalizeToPawnValue? (for statistical purposes, external tools can always compute the (rounded) sum of the coefficients in as)

perhaps the code can be done away with, but leave a comment next to as with the value so that us silly humans can eyeball the average scale factor

@robertnurnberg
Copy link
Contributor Author

perhaps the code can be done away with, but leave a comment next to as with the value so that us silly humans can eyeball the average scale factor

Hm, a comment would be quite fragile and in the long run could get out of sync with the actual code. With git blame the "offending" PR should be easy to locate, and good practice would be to mention the value in the PR message or comments. As in #4920 (comment) above. So IMO it's fine as it is, but happy to hear from maintainers.

@cmwetherell
Copy link

cmwetherell commented Dec 18, 2023

Cool PR! Funny enough, I was poking through the code today wondering how and why "move 32" was picked, instead of just normalizing in some fashion so all moves would be 50% @ 100 CP.

Has there been research on why the Internal Value units drift higher for the same win-rate later in the game? I think this result implies the model error changes by move number. Why?

Perhaps SF would be better if this were not the case - right now, it may be overestimating win chances in deeper search trees relative to shallow search trees.

So this comment is necessarily related to merging the PR, but the diagnostic charts above do give some insight that are worth researching. Let me know if you have thoughts or would prefer I move this to a "Discussion".

edit: In case it's not clear what was meant by "Internal Value units drift higher", I mean that in the diagnostic heat maps, we can see that at move 120 it takes more "IVUs" to achieve a certain win probability than at move 30.

@Craftyawesome
Copy link

Should there be any effort to try to handle fens with wrong move numbers? A lot will report move 1 despite being middlegame/endgame. Maybe fall back to fixed ply if fen is move 1 and not a DFRC starting position? Or maybe fitting by piece count has improved since it was last tried?

@cmwetherell
Copy link

Should there be any effort to try to handle fens with wrong move numbers? A lot will report move 1 despite being middlegame/endgame. Maybe fall back to fixed ply if fen is move 1 and not a DFRC starting position? Or maybe fitting by piece count has improved since it was last tried?

This is tangentially related to my questions above. If the expected score of a position was consistent, regardless of move number, why would we need any of this? Does SFNNv8 use move number as an input? If the evaluation function isn’t aware of the move number, should it be?

@robertnurnberg
Copy link
Contributor Author

robertnurnberg commented Dec 18, 2023

Has there been research on why the Internal Value units drift higher for the same win-rate later in the game? I think this result implies the model error changes by move number. Why?

A possible explanation is that at high move counters, say near move 120, the proportion of fortress-type positions is higher, and that these may have many mis-evals at fishtest LTC, leading to a larger value of a.

@robertnurnberg
Copy link
Contributor Author

Should there be any effort to try to handle fens with wrong move numbers? A lot will report move 1 despite being middlegame/endgame. Maybe fall back to fixed ply if fen is move 1 and not a DFRC starting position? Or maybe fitting by piece count has improved since it was last tried?

I think handling FENs with wrong move numbers belongs to the user space, and is not the responsibility of SF. I agree that a material based model would avoid this issue completely. Unfortunately, the graphs for a and b based on material count at first glance are less smooth.

Maybe a better place to discuss the merits of the material based fitting is on the WDL repo, e.g. here: official-stockfish/WDL_model#152

@Craftyawesome
Copy link

A possible explanation is that at high move counters, say near move 120, the proportion of fortress-type positions is higher, and that these may have many mis-evals at fishtest LTC, leading to a larger value of a.

Yes, I suspect this is the majority of the reason. This may be testable? What if we filter out games with repeated evals for 3+ moves?

On the other hand, part of the dataset for training the net is sf data, so if there is bias there it would still likely affect new nets. Also AFAIK even for leela data the formula used to convert is not sharpness independent, so there could be some bias there too.

I think handling FENs with wrong move numbers belongs to the user space, and is not the responsibility of SF.

IDK, a lot of fens include the 0 1 part regardless of if that's accurate. I don't think you can reasonably expect a GUI to disregard an explicit move number. And for things like manually placing pieces the move number is often shown and editable and therefore explicit. I could technically see things like fens without movenumbers and board scanners guessing the movenumber, but I can't imagine most if any would bother. IMO leaving it to user space is just accepting a regression in interpretability for many fens. If you think this is worth it for the increase of interpretability for correct fens/pgns then fine, but I wouldn't expect user space to help this.

@robertnurnberg
Copy link
Contributor Author

A possible explanation is that at high move counters, say near move 120, the proportion of fortress-type positions is higher, and that these may have many mis-evals at fishtest LTC, leading to a larger value of a.

Yes, I suspect this is the majority of the reason. This may be testable? What if we filter out games with repeated evals for 3+ moves?

Where do you want to apply the filter? When we fit the WDL model? That would be wrong, because the model should match the playing SF, and these fortress mis-evals are part of it. For the example at hand, it would be wrong not to assign 100cp to an internal eval of 350 at move 120, because the data says that SF only wins 50% of those games at fishtest LTC against a similarly strong opponent.

I think handling FENs with wrong move numbers belongs to the user space, and is not the responsibility of SF.

IDK, a lot of fens include the 0 1 part regardless of if that's accurate. I don't think you can reasonably expect a GUI to disregard an explicit move number. And for things like manually placing pieces the move number is often shown and editable and therefore explicit. I could technically see things like fens without movenumbers and board scanners guessing the movenumber, but I can't imagine most if any would bother. IMO leaving it to user space is just accepting a regression in interpretability for many fens. If you think this is worth it for the increase of interpretability for correct fens/pgns then fine, but I wouldn't expect user space to help this.

I strongly believe that SF internally changing the fullmove number given in the FEN, or assigning anything else than 0 1 to incomplete FENs, is a complete no-no.

@Craftyawesome
Copy link

Where do you want to apply the filter? When we fit the WDL model? That would be wrong, because the model should match the playing SF, and these fortress mis-evals are part of it. For the example at hand, it would be wrong not to assign 100cp to an internal eval of 350 at move 120, because the data says that SF only wins 50% of those games at fishtest LTC against a similarly strong opponent.

I wasn't proposing doing this for the actual model, only for getting information. I was just curious if a increasing with ply was primarily/fully a result of fortress positions. As in remove all games with flatlines and see if the resulting WDL model has a much flatter a.

I strongly believe that SF internally changing the fullmove number given in the FEN, or assigning anything else than 0 1 to incomplete FENs, is a complete no-no.

That's an alternative, but I was actually thinking if something is done in SF it could just fall back to NormalizeToPawnValue when the fen is move 1 and not DFRC starting pos. If done in user space outside of SF then changing the move number is the only option.

@robertnurnberg
Copy link
Contributor Author

That's an alternative, but I was actually thinking if something is done in SF it could just fall back to NormalizeToPawnValue when the fen is move 1 and not DFRC starting pos. If done in user space outside of SF then changing the move number is the only option.

Ok, that is an interesting proposal. I will wait and see what SF maintainers think about that.

@robertnurnberg
Copy link
Contributor Author

robertnurnberg commented Jan 10, 2024

In discussion with @vondele on discord it was decided to stick with the static eval renormalization in master for now. I will be opening a corresponding PR in time for the planned 16.1 release, which will have the following changes:

Model:

  • an incorrect 8-move shift in master's WDL model has been fixed
  • the polynomials p_a and p_b are fitted over the move range [8, 120]
  • the coefficients for p_a and p_b are optimized by maximizing the probability of predicting the observed outcome

SF code:

  • for wdl values, move will be clamped to max(8, min(120, move))
  • no longer clamp the internal eval to [-4000,4000]
  • compute NormalizeToPawnValue with round, not trunc

Closing this PR for now. It can be re-opened after the 16.1 release, or be replaced with a dynamic eval renormalization based on material count (rather than full move counter).

Disservin pushed a commit that referenced this pull request Mar 20, 2024
This PR proposes to change the parameter dependence of Stockfish's
internal WDL model from full move counter to material count. In addition
it ensures that an evaluation of 100 centipawns always corresponds to a
50% win probability at fishtest LTC, whereas for master this holds only
at move number 32. See also
#4920 and the
discussion therein.

The new model was fitted based on about 340M positions extracted from
5.6M fishtest LTC games from the last three weeks, involving SF versions
from e67cc97 (SF 16.1) to current
master.

The involved commands are for
[WDL_model](https://github.com/official-stockfish/WDL_model) are:
```
./updateWDL.sh --firstrev e67cc97
python scoreWDL.py updateWDL.json --plot save --pgnName update_material.png --momType "material" --momTarget 58 --materialMin 10 --modelFitting optimizeProbability
```

The anchor `58` for the material count value was chosen to be as close
as possible to the observed average material count of fishtest LTC games
at move 32 (`43`), while not changing the value of
`NormalizeToPawnValue` compared to the move-based WDL model by more than
1.

The patch only affects the displayed cp and wdl values.

closes #5121

No functional change
linrock pushed a commit to linrock/Stockfish that referenced this pull request Mar 27, 2024
This PR proposes to change the parameter dependence of Stockfish's
internal WDL model from full move counter to material count. In addition
it ensures that an evaluation of 100 centipawns always corresponds to a
50% win probability at fishtest LTC, whereas for master this holds only
at move number 32. See also
official-stockfish#4920 and the
discussion therein.

The new model was fitted based on about 340M positions extracted from
5.6M fishtest LTC games from the last three weeks, involving SF versions
from e67cc97 (SF 16.1) to current
master.

The involved commands are for
[WDL_model](https://github.com/official-stockfish/WDL_model) are:
```
./updateWDL.sh --firstrev e67cc97
python scoreWDL.py updateWDL.json --plot save --pgnName update_material.png --momType "material" --momTarget 58 --materialMin 10 --modelFitting optimizeProbability
```

The anchor `58` for the material count value was chosen to be as close
as possible to the observed average material count of fishtest LTC games
at move 32 (`43`), while not changing the value of
`NormalizeToPawnValue` compared to the move-based WDL model by more than
1.

The patch only affects the displayed cp and wdl values.

closes official-stockfish#5121

No functional change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants