-
Notifications
You must be signed in to change notification settings - Fork 612
How MuZero works
MuZero is a state of the art RL algorithm for board games (Chess, Go, ...) and Atari games. It is the successor to AlphaZero but without any knowledge of the environment underlying dynamics. MuZero learns a model of the environment and uses an internal representation that contains only the useful information for predicting the reward, value, policy and transitions. MuZero is also close to Value prediction networks.
It is the most recent algorithm in a long line of reinforcement learning algorithms imagined by researchers at DeepMind. It starts in 2014, with the creation of the famous AlphaGo. Alpha Go defeated Go champion Lee Sedol in 2015.
AlphaGo uses an old principle called Monte Carlo Tree Search (MCTS) to plan the rest of the game as a decision tree. But the subtlety is to add a neural network to intelligently select the moves and therefore reduce the number of future moves to simulate.
AlphaGo's neural network is trained at the start with a history of games played by humans, in 2017 AlphaGo is replaced by AlphaGo Zero which no longer uses this history. For the first time, the algorithm learns to play alone, without a priori knowledge of Go strategies. It plays against itself (with MCTS) then learns from these parts (training of the neural network).
AlphaZero is a version of AlphaGo Zero in which we removed all the little tricks related to the game of Go so that it can be generalized to other board games.
In November 2019, the DeepMind team released a new, even more general version of AlphaZero called MuZero. Its particularity is to be model based which means that the algorithm has its own understanding of the game. Thus the rules of the game are learned by the neural network. MuZero builds his own understanding of the game. He can imagine for himself what the game will look like if one performs such or such actions. Before MuZero, the effect of an action on the game was hard coded. The consequence of this is to allow him to play any game like a human who discovers a game.
To understand the technical parts, we found the following articles to be interesting.
- Muzero
- Alpha Zero
- ELF Open Go (Facebook implementation of AlphaZero)
- Alpha Go
- Value prediction networks
- SimPLe
- World Models
This repository is based on the MuZero pseudocode. We have tried to keep the code well organized and well commented to make it easy to understand how MuZero works. The code works asynchronously (parallel / multi threaded). We have added a version with a fully connected network (MLP) and some tools such as a visualization tool to understand the model (diagnose_model) and a hyperparameter search tool. For other features, please refer to the readme.
There are four components which are classes that run simultaneously in a dedicated thread.
The shared storage
holds the latest neural network weights, the self-play
uses those weights to generate self-play games and store them in the replay buffer
. Finally, those played games are used to train
a network and store the weights in the shared storage. The circle is complete.
Those components are launched and managed from the MuZero class in muzero.py
and the structure of the neural network is defined in models.py
.
Here is the diagram of the muzero network applied to the Atari game: