-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update WDL model and normalize eval dynamically #4920
update WDL model and normalize eval dynamically #4920
Conversation
Here is the corresponding visualization of the new WDL model. The new model and the graphics were produced with the help of the The output of the script is as follows:
The output of
|
Some points to be discussed @vondele :
|
perhaps the code can be done away with, but leave a comment next to |
Hm, a comment would be quite fragile and in the long run could get out of sync with the actual code. With git blame the "offending" PR should be easy to locate, and good practice would be to mention the value in the PR message or comments. As in #4920 (comment) above. So IMO it's fine as it is, but happy to hear from maintainers. |
Cool PR! Funny enough, I was poking through the code today wondering how and why "move 32" was picked, instead of just normalizing in some fashion so all moves would be 50% @ 100 CP. Has there been research on why the Internal Value units drift higher for the same win-rate later in the game? I think this result implies the model error changes by move number. Why? Perhaps SF would be better if this were not the case - right now, it may be overestimating win chances in deeper search trees relative to shallow search trees. So this comment is necessarily related to merging the PR, but the diagnostic charts above do give some insight that are worth researching. Let me know if you have thoughts or would prefer I move this to a "Discussion". edit: In case it's not clear what was meant by "Internal Value units drift higher", I mean that in the diagnostic heat maps, we can see that at move 120 it takes more "IVUs" to achieve a certain win probability than at move 30. |
Should there be any effort to try to handle fens with wrong move numbers? A lot will report move 1 despite being middlegame/endgame. Maybe fall back to fixed ply if fen is move 1 and not a DFRC starting position? Or maybe fitting by piece count has improved since it was last tried? |
This is tangentially related to my questions above. If the expected score of a position was consistent, regardless of move number, why would we need any of this? Does SFNNv8 use move number as an input? If the evaluation function isn’t aware of the move number, should it be? |
A possible explanation is that at high move counters, say near move 120, the proportion of fortress-type positions is higher, and that these may have many mis-evals at fishtest LTC, leading to a larger value of |
I think handling FENs with wrong move numbers belongs to the user space, and is not the responsibility of SF. I agree that a material based model would avoid this issue completely. Unfortunately, the graphs for Maybe a better place to discuss the merits of the material based fitting is on the WDL repo, e.g. here: official-stockfish/WDL_model#152 |
Yes, I suspect this is the majority of the reason. This may be testable? What if we filter out games with repeated evals for 3+ moves? On the other hand, part of the dataset for training the net is sf data, so if there is bias there it would still likely affect new nets. Also AFAIK even for leela data the formula used to convert is not sharpness independent, so there could be some bias there too.
IDK, a lot of fens include the 0 1 part regardless of if that's accurate. I don't think you can reasonably expect a GUI to disregard an explicit move number. And for things like manually placing pieces the move number is often shown and editable and therefore explicit. I could technically see things like fens without movenumbers and board scanners guessing the movenumber, but I can't imagine most if any would bother. IMO leaving it to user space is just accepting a regression in interpretability for many fens. If you think this is worth it for the increase of interpretability for correct fens/pgns then fine, but I wouldn't expect user space to help this. |
Where do you want to apply the filter? When we fit the WDL model? That would be wrong, because the model should match the playing SF, and these fortress mis-evals are part of it. For the example at hand, it would be wrong not to assign 100cp to an internal eval of 350 at move 120, because the data says that SF only wins 50% of those games at fishtest LTC against a similarly strong opponent.
I strongly believe that SF internally changing the fullmove number given in the FEN, or assigning anything else than |
I wasn't proposing doing this for the actual model, only for getting information. I was just curious if
That's an alternative, but I was actually thinking if something is done in SF it could just fall back to |
Ok, that is an interesting proposal. I will wait and see what SF maintainers think about that. |
In discussion with @vondele on discord it was decided to stick with the static eval renormalization in master for now. I will be opening a corresponding PR in time for the planned 16.1 release, which will have the following changes: Model:
SF code:
Closing this PR for now. It can be re-opened after the 16.1 release, or be replaced with a dynamic eval renormalization based on material count (rather than full move counter). |
This PR proposes to change the parameter dependence of Stockfish's internal WDL model from full move counter to material count. In addition it ensures that an evaluation of 100 centipawns always corresponds to a 50% win probability at fishtest LTC, whereas for master this holds only at move number 32. See also #4920 and the discussion therein. The new model was fitted based on about 340M positions extracted from 5.6M fishtest LTC games from the last three weeks, involving SF versions from e67cc97 (SF 16.1) to current master. The involved commands are for [WDL_model](https://github.com/official-stockfish/WDL_model) are: ``` ./updateWDL.sh --firstrev e67cc97 python scoreWDL.py updateWDL.json --plot save --pgnName update_material.png --momType "material" --momTarget 58 --materialMin 10 --modelFitting optimizeProbability ``` The anchor `58` for the material count value was chosen to be as close as possible to the observed average material count of fishtest LTC games at move 32 (`43`), while not changing the value of `NormalizeToPawnValue` compared to the move-based WDL model by more than 1. The patch only affects the displayed cp and wdl values. closes #5121 No functional change
This PR proposes to change the parameter dependence of Stockfish's internal WDL model from full move counter to material count. In addition it ensures that an evaluation of 100 centipawns always corresponds to a 50% win probability at fishtest LTC, whereas for master this holds only at move number 32. See also official-stockfish#4920 and the discussion therein. The new model was fitted based on about 340M positions extracted from 5.6M fishtest LTC games from the last three weeks, involving SF versions from e67cc97 (SF 16.1) to current master. The involved commands are for [WDL_model](https://github.com/official-stockfish/WDL_model) are: ``` ./updateWDL.sh --firstrev e67cc97 python scoreWDL.py updateWDL.json --plot save --pgnName update_material.png --momType "material" --momTarget 58 --materialMin 10 --modelFitting optimizeProbability ``` The anchor `58` for the material count value was chosen to be as close as possible to the observed average material count of fishtest LTC games at move 32 (`43`), while not changing the value of `NormalizeToPawnValue` compared to the move-based WDL model by more than 1. The patch only affects the displayed cp and wdl values. closes official-stockfish#5121 No functional change
Ensure that an evaluation of 100 centipawns always corresponds to a 50% win probability at fishtest LTC, irrespective of the move number.
This PR is the culmination of recent work by @Disservin, @vondele and myself on https://github.com/official-stockfish/WDL_model, as well as recent changes to https://github.com/official-stockfish/books and https://github.com/official-stockfish/fishtest.
The new model was fitted based on about 500M positions extracted from 7.9M fishtest LTC games from the last three weeks, involving SF versions from b59786e to current master.
A summary of the changes to the WDL model itself:
p_a
andp_b
are fitted over the move range [8, 120]p_a
andp_b
are optimized by maximizing the probability of predicting the observed outcome (credits to @vondele)A summary of the changes to the SF code:
p_a(32)
(akaNormalizeToPawnValue
), but byp_a(max(8, min(120, move)))
(credits to @vondele, see https://github.com/vondele/Stockfish/tree/wdlScore)NormalizeToPawnValue
win_rate_model()
we no longer clamp the internal eval to [-4000,4000]This PR is in draft status for now, to allow for some discussion of the proposed changes.
No functional change.