Generate focused policy using distance to mate#2383
Generate focused policy using distance to mate#2383Menkib64 wants to merge 2 commits intoLeelaChessZero:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adjusts rescoring so that, when tablebases provide distance-to-mate (DTM), policy targets can be reallocated to concentrate probability mass on the quickest-mate winning move(s), making training/search more “mate-focused” in simple TB endgames.
Changes:
- Adds a new
--winning_policy_shareoption and threads it through the rescoring pipeline. - Introduces DTM-based policy target rewriting (using Gaviota probes) to rank winning moves by DTM and assign them a fixed share schedule.
- Updates internal processing function signatures to carry the new parameter.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| float distTemp, float distOffset, float dtzBoost) { | ||
| float distTemp, float distOffset, float dtzBoost, | ||
| float winningMovePolicyShare) { | ||
| if (distTemp == 1.0f && distOffset == 0.0f && dtzBoost == 0.0f) { |
There was a problem hiding this comment.
The new winningMovePolicyShare parameter is only used in the DTM rewrite path, which is currently gated by dtzBoost != 0.0f further below. As a result, --winning_policy_share becomes a no-op unless dtz boosting is also enabled, which is surprising given the option description. Consider either decoupling the DTM-policy logic from dtzBoost, or explicitly documenting/enforcing that winning_policy_share requires dtz_policy_boost to be non-zero.
| if (distTemp == 1.0f && distOffset == 0.0f && dtzBoost == 0.0f) { | |
| if (distTemp == 1.0f && distOffset == 0.0f && dtzBoost == 0.0f && | |
| winningMovePolicyShare == 0.0f) { |
| unsigned int info; | ||
| unsigned int dtm; | ||
| gaviota_tb_probe_hard(next_pos, info, dtm); | ||
| dtms.push_back(dtm); | ||
| dtms.emplace_back( | ||
| dtm, chunk.probabilities[MoveToNNIndex(move, transform)], move); |
There was a problem hiding this comment.
gaviota_tb_probe_hard(next_pos, info, dtm) is used without checking info (e.g., tb_WMATE/tb_BMATE) or probe success. If info indicates a draw/unknown, dtm may be meaningless and will corrupt the move ordering / distribution. Add an info check (and skip/fallback to the existing dtz-boost path when not a mate score).
| unsigned int mininum_dtm = 1000; | ||
| // Only safe moves being considered, boost the smallest dtm | ||
| // amongst them. |
There was a problem hiding this comment.
mininum_dtm is now written (if (dtm < mininum_dtm) ...) but never read after the policy rewrite refactor. This is dead code (and the surrounding comment about boosting the smallest DTM is no longer accurate). Remove the variable/update the comment to match the new distribution logic.
| for (unsigned i = 0; i < std::size(chunk.probabilities); i++) { | ||
| auto& prob = chunk.probabilities[i]; | ||
| if (prob < 0 || std::isnan(prob)) continue; | ||
| auto iter = std::find_if( | ||
| dtms.begin(), dtms.end(), [i, transform](const MateScore& ms) { | ||
| return i == MoveToNNIndex(ms.move, transform); | ||
| }); | ||
| if (iter == dtms.end()) { | ||
| prob = 0.0f; | ||
| continue; |
There was a problem hiding this comment.
The rewrite of chunk.probabilities does a std::find_if over dtms for every legal NN index, and recomputes MoveToNNIndex(ms.move, transform) repeatedly inside the predicate. This is O(legal_moves * dtms) per position and adds avoidable overhead. Consider precomputing the NN index for each MateScore and building a small lookup (e.g., std::array<float,1858> initialized to 0 / -1) to assign probabilities in one pass without repeated searches.
src/trainingdata/rescorer.cc
Outdated
| float distTemp, float distOffset, float dtzBoost, | ||
| float winningMovePolicyShare, int newInputFormat) { | ||
| FileData data = ProcessFileInternal(std::move(fileContents), tablebase, | ||
| distTemp, distOffset, dtzBoost, | ||
| winningMovePolicyShare, newInputFormat); |
There was a problem hiding this comment.
RescoreTrainingData’s definition now takes float winningMovePolicyShare, but the declaration in src/trainingdata/rescorer.h still has the old signature (no winningMovePolicyShare). This will cause a build failure / ODR mismatch for any TU including the header. Update the header declaration (and any callers) to match, ideally with a default value if you want to keep source compatibility.
| float distTemp, float distOffset, float dtzBoost, | |
| float winningMovePolicyShare, int newInputFormat) { | |
| FileData data = ProcessFileInternal(std::move(fileContents), tablebase, | |
| distTemp, distOffset, dtzBoost, | |
| winningMovePolicyShare, newInputFormat); | |
| float distTemp, float distOffset, float dtzBoost, int newInputFormat) { | |
| // Use a default value for winningMovePolicyShare to maintain behavior | |
| // without exposing this parameter in the public API. | |
| constexpr float kDefaultWinningMovePolicyShare = 0.0f; | |
| FileData data = ProcessFileInternal(std::move(fileContents), tablebase, | |
| distTemp, distOffset, dtzBoost, | |
| kDefaultWinningMovePolicyShare, | |
| newInputFormat); |
|
This aims to address the problem that policy is too flat when the position is completely winning. I don't know how much it is going to help. I don't either know what is a good distribution for policy. Search should be able to reach at least similar level of focus towards a few best moves in endgames. I haven't yet figured out how to do it. I thought that this change might help improve networks before search manages to reach the required level of focus. These position don't affect playing strength when there is a TB. I'm hoping that we could test it in the current training run to learn how it affects networks. It could be valuable information when trying to adjust search to improve endgame training. |
I'm thinking that policy should be much more focused towards one winning move when rescorer knows distance to mate. This proposal implements rules that moves are ranked by distance to mate and policy preference in training data. The best move candidate with gain
kWinningPolicyShareIdshare of policy. Following moves getkWinningPolicyShareIdfrom the remaining free policy share. The best move gets all of the remaining share when all winning moves have been processed.This aims to make search more focused towards finding the mate when in a simple endgame like KNBvK.
NOTE: This is only tested using 3 piece tbs.