Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul of PUCT formula (and ramifications) #818

Closed
kk2796 opened this issue Mar 27, 2019 · 6 comments
Closed

Overhaul of PUCT formula (and ramifications) #818

kk2796 opened this issue Mar 27, 2019 · 6 comments

Comments

@kk2796
Copy link

kk2796 commented Mar 27, 2019

Both PR #816 and PR #796 have become stepping stones toward a vision I have toward a broader PUCT overhaul. I wanted to use this issue to lay out that idea for feedback (and ideas for how we might approach implementation).

My biggest pet-peeve with the A0 / Deepmind PUCT formula is it's "fuzziness". It seems to be the result of countless empirical optimizations of educated guesses at good ways to describe the "search-worthiness" of a given action a in a given state s. In the statement: "whichever action a gives the greatest value for Q(s,a)+U(s,a) will make "something" better than actions which give smaller values of Q+U", there is no coherent definition of "something".

As a result of this fuzziness, and lack of coherent "something" we're seeking to improve, it is impossible to determine whether any modification to Q+U (or any of its underlying hyper-parameters) is an objective improvement in measuring that "something". So the only means of checking whether any change is useful is trial-and-error (and the only means of guessing at what good changes might be is, literally, guessing).

My aim is to put in place a PUCT formula that's tied to a real "something" that can be measured for each action. And, more, a real "something" for which it actually makes sense to seek the action that maximizes that "something."

My proposed "something" is based on the calculations outlined in this paper:

Exact Distribution of the Max/Min of Two Gaussian Random Variables

If we can calculate reasonable values for expected value EV of each child node, along with meaningful estimate of uncertainty/stderr of that EV, then this paper shows that it is possible to calculate

  • the EV of the max value of all children
  • the uncertainty of the EV of the max of all children

The "Golden Ticket" PUCT formula - G(s,a) for lack of better alternative - I envision would assess for each action (s,a), "If this action a is taken in state s, and we assume the next new measure of the associated child node would match its current EV, then we can predict how much that measurement=EV would reduce our uncertainty in that child's EV. From that, we can predict what the new uncertainty of the max of all EVs of all children would be, given the reduction of uncertainty of that one child. The PUCT formula would return the reduction from the original pre-action-a uncertainty in max(EV) to the estimated post-action-a uncertainty in max(EV)."

I think it's easier to follow this formula, and appreciate its elegance, with a hypothetical example. Imagine a state s with two children nodes A and B, with current estimated Q values (and 95% confidence interval based on stderr given current stdev and number of visits to each):
A=0.4 +/- 0.1
B=-0.3 +/- 0.5

With that information alone, the math of the paper allows us to compute that the expected value EV(max(A,B)), which should be roughly 0.41. This should be a little higher than EV(A)=0.4 because the slight chance that B could be better than A. That math might also show the uncertainty of this max as 0.11, which should be slightly higher than uncertaint(A)=0.1 due to the uncertainty B brings. Overall, we should get something like:

EV(max(A,B)) = 0.41 +/- 0.11

Without actually visiting A, we can look at current visits to A and compute the new uncertainty we'd get if we "simulated" a new visit A that returned a simulated measurement of 0.4. Assuming the small uncertainty in A is due to an already large number of visits, one more visit should only give a slight reduction of uncertainty in A'; perhaps EV(A') = 0.4 +/- 0.99. Using the new EV(A'), along with unmodified EV(B), we might find the new expected max value becomes:

EV(max(A',B)) = 0.409 +/- 0.108

Conversely, the initial large uncertainty in B, +/- 0.5, is likely to be based on B having relatively few visits. In the case of B having just two visits, we might find that a third measurement of B (with simulated value of -0.3 = EV(B) ) would drastically cut its uncertainty, perhaps yielding EV(B') = -0.3 +/- 0.25.

While this is a big reduction in uncertainty of B, it really just makes B slightly less likely to be bigger than A. Thus the effect on expected max value of both A and B would be a slightly smaller EV(max(A,B')) and slightly smaller uncertainty(A,B'). Running the numbers might give:

EV(max(A,B')) = 0.408 +/- 0.107

The "golden ticket" PUCT formula G(s,a) would thus return, for
G(s,a[A]) = uncertainty(max(A,B))) - uncertainty(max(A',B))) = 0.11 - 0.108 = 0.002
G(s,a[B]) = uncertainty(max(A,B))) - uncertainty(max(A,B'))) = 0.11 - 0.107 = 0.003

In this scenario then, we could say that a visit to B would yield more information than a visit to A in obtaining the most accurate assessment of max(A,B) for state s. This is vitally important because max(A,B) and Q(parent(A,B)) ultimately mean the same thing (and will converge to the same value as number of visits to s increases).

To me, the most compelling aspect of this approach is its parsimony. Though it's a bit of a mouthful (threadful?) to annunciate, and it'll definitely be challenging to implement, it really is just building a PUCT formula on a single atomic idea. Nevertheless, it yields all the desired intuition-based behaviors of Q+U that have required countless tweaks of cPuct "constant", complex FPU logic, and arbitrary-feeling terms like sqrt(N,b) and 1/(1+N(s,a)) to get decent ELO out of Q+U.

I'm not naive enough to present this proposal with the idea that we can whip up a quick prototype to test it out. However, I do plan to work on putting a functional implementation together (possibly using several incremental PRs as additional stepping stones to the final G(s,a) idea) over the coming days/weeks - and welcome input from and assistance of any other developers interested in contributing.

@oscardssmith
Copy link
Contributor

Good issue. I hadn't seen that paper before. One downside is that it's not clear that any of the distributions are normal, or how correlated they are with each other. I've been working for a long time to try and find a more grounded formula. If you want help, I have a lot of free time this week. Feel free to PM me in discord (username oscardssmith)

@kk2796
Copy link
Author

kk2796 commented Mar 28, 2019

True. I definitely think it's going to be a lot of heavy lifting just to get something reasonably effective in place. And I sincerely doubt it will be competitive with the existing PUCT formula right off the bat (after all, that PUCT formula has been massively-tuned and CLOP'ed ad nauseum - even though it's built on an ugly model, I do not question that it's been hammered into a decent approximation of any superior model). However, once something functional is in place, the real beauty of the change should become apparent. We'll be able to tweak/tune the PUCT formula with an eye toward "accurately modeling a real phenomenon" (e.g. finding a version of the statistical methods of the paper which apply to a better-suited non-Gaussian distribution)... rather than relying on the "smack the television in the back and check if the picture got clearer" approach necessitated by the existing PUCT formula.

@MelleKoning
Copy link
Contributor

Does #317 relate?

@Veedrac
Copy link

Veedrac commented Apr 7, 2019

I've tried something like this and I couldn't get it to help. The issues are that

  1. You don't want the expectation of the maximum. EV(max(A,B)) should be between 0.3 and 0.4, because the 0.4 already accounts for the fact that future moves could be surprisingly good, so the 0.3 doesn't actually add to that.

  2. Much more surprisingly, you don't want to underweight low values you'd expect to be minimax-pruned, because they are indicative of the general riskiness of the move. UCT already underweights them by an appropriate amount by just visiting them less.

It's #2 that killed my idea.

@oscardssmith
Copy link
Contributor

@Veedrac the problem with 2 is that by relying on node distribution for weight, you remove the possibility of lots of search optimizations such as graph (rather than tree) search.

@Naphthalin
Copy link
Contributor

This is relevant to the discussion collected in #1231 about changing the PUCT formula.

@LeelaChessZero LeelaChessZero locked and limited conversation to collaborators Jun 27, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants