Open
Description
Currently, the policy network is actively trained to output 0 on a move whenever it's invalid. The move is never taken, so target_pi's is zero there, and this enters into the loss function as a result. As a result, the policy network will be saying as much about "How often is this move legal" as it does about "Is this a good move" -- which is almost certainly hurting performance.
The correct action would be to manually mask out the result in NNet.py, right after getting the output from OthelloNNet.py: OthelloNNet.py returns F.log_softmax(pis), and then the entries there should be manually set to zero if it's not a valid move. This will prevent gradients from propagating back there.
Activity