opendilab · PaParaZz1 · Jul 1, 2023 · Jun 30, 2023 · Jul 1, 2023
diff --git a/README.md b/README.md
@@ -116,14 +116,15 @@ The environments and algorithms currently supported by LightZero are shown in th
 
 | Env./Alg.      | AlphaZero | MuZero | EfficientZero | Sampled EfficientZero | Gumbel MuZero |
 | ------------- | --------- | ------ | ------------- | --------------------- | ------------- |
-| Atari         | ---       | ✔    | ✔           | ✔                   | ✔             |
-| tictactoe     | ✔       | ✔    | 🔒          | 🔒                  | ✔             |
-| gomoku        | ✔       | ✔    | 🔒          | 🔒                  | ✔             |
-| go            | 🔒       | 🔒   | 🔒          | 🔒                  | 🔒             |
-| lunarlander | ---       | ✔    | ✔           | ✔                   | ✔             |
-| bipedalwalker   | ---       | ✔    | ✔           | ✔                   | 🔒             |
-| cartpole     | ---       | ✔    | ✔           | ✔                   | ✔             |
-| pendulum      | ---       | ✔    | ✔           | ✔                   | ✔             |
+| Atari         | ---       | ✔      | ✔       | ✔                   | ✔            |
+| TicTacToe     | ✔       | ✔      | 🔒      | 🔒                  | ✔            |
+| Gomoku        | ✔       | ✔      | 🔒      | 🔒                  | ✔            |
+| Go            | 🔒       | 🔒     | 🔒      | 🔒                  | 🔒            |
+| LunarLander   | ---       | ✔      | ✔       | ✔                   | ✔            |
+| BipedalWalker | ---       | ✔      | ✔       | ✔                   | 🔒            |
+| CartPole      | ---       | ✔      | ✔       | ✔                   | ✔            |
+| Pendulum      | ---       | ✔      | ✔       | ✔                   | ✔            |
+| MuJoCo        | ---       | 🔒     | 🔒      | ✔                   | 🔒            |
 
 <sup>(1): "✔" means that the corresponding item is finished and well-tested.</sup>
 
@@ -181,13 +182,18 @@ Below are the benchmark results of [MuZero](https://github.com/opendilab/LightZe
 </p>
 
 
-Below are the benchmark results of [Sampled EfficientZero](https://github.com/opendilab/LightZero/blob/main/lzero/policy/sampled_efficientzero.py) with ``Factored/Gaussian`` policy representation on three continuous action space games: [Pendulum-v1](https://github.com/opendilab/LightZero/blob/main/zoo/classic_control/pendulum/envs/pendulum_lightzero_env.py), [LunarLanderContinuous-v2](https://github.com/opendilab/LightZero/blob/main/zoo/box2d/lunarlander/envs/lunarlander_env.py), [BipedalWalker-v3](https://github.com/opendilab/LightZero/blob/main/zoo/box2d/bipedalwalker/envs/bipedalwalker_env.py).
-> Where ``Factored Policy`` indicates that the agent learns a policy network that outputs a categorical distribution, the dimensions of the action space for the three environments are 11, 49 (7^2), and 256 (4^4), respectively, after manual discretization. On the other hand, ``Gaussian Policy`` indicates that the agent learns a policy network that outputs parameters (mu and sigma) for a Gaussian distribution.
+Below are the benchmark results of [Sampled EfficientZero](https://github.com/opendilab/LightZero/blob/main/lzero/policy/sampled_efficientzero.py) with ``Factored/Gaussian`` policy representation on three classic continuous action space games: [Pendulum-v1](https://github.com/opendilab/LightZero/blob/main/zoo/classic_control/pendulum/envs/pendulum_lightzero_env.py), [LunarLanderContinuous-v2](https://github.com/opendilab/LightZero/blob/main/zoo/box2d/lunarlander/envs/lunarlander_env.py), [BipedalWalker-v3](https://github.com/opendilab/LightZero/blob/main/zoo/box2d/bipedalwalker/envs/bipedalwalker_env.py)
+and two MuJoCo continuous action space games: [Hopper-v3](https://github.com/opendilab/LightZero/blob/main/zoo/mujoco/envs/mujoco_lightzero_env.py), [Walker2d-v3](https://github.com/opendilab/LightZero/blob/main/zoo/mujoco/envs/mujoco_lightzero_env.py).
+> Where ``Factored Policy`` indicates that the agent learns a policy network that outputs a categorical distribution, the dimensions of the action space for the five environments are 11, 49 (7^2), 256 (4^4), 64 (4^3) and 4096 (4^6), respectively, after manual discretization. On the other hand, ``Gaussian Policy`` indicates that the agent learns a policy network that outputs parameters (mu and sigma) for a Gaussian distribution.
 <p align="center">
-  <img src="assets/benchmark/main/pendulum_main.png" alt="Image Description 1" width="23%" height="auto" style="margin: 0 1%;">
-  <img src="assets/benchmark/ablation/pendulum_sez_K.png" alt="Image Description 2" width="23%" height="auto" style="margin: 0 1%;">
-  <img src="assets/benchmark/main/lunarlander_main.png" alt="Image Description 3" width="23%" height="auto" style="margin: 0 1%;">
-  <img src="assets/benchmark/main/bipedalwalker_main.png" alt="Image Description 3" width="23%" height="auto" style="margin: 0 1%;">
+  <img src="assets/benchmark/main/pendulum_main.png" alt="Image Description 1" width="33%" height="auto" style="margin: 0 1%;">
+  <img src="assets/benchmark/ablation/pendulum_sez_K.png" alt="Image Description 2" width="33%" height="auto" style="margin: 0 1%;">
+  <img src="assets/benchmark/main/lunarlander_main.png" alt="Image Description 3" width="33%" height="auto" style="margin: 0 1%;">
+</p>
+<p align="center">
+  <img src="assets/benchmark/main/bipedalwalker_main.png" alt="Image Description 3" width="33%" height="auto" style="margin: 0 1%;">
+  <img src="assets/benchmark/main/hopper_main.pdf" alt="Image Description 1" width="33%" height="auto" style="margin: 0 1%;">
+  <img src="assets/benchmark/main/walker2d_main.pdf" alt="Image Description 3" width="33%" height="auto" style="margin: 0 1%;">
 </p>
 
 Below are the benchmark results of [AlphaZero](https://github.com/opendilab/LightZero/blob/main/lzero/policy/alphazero.py) and [MuZero](https://github.com/opendilab/LightZero/blob/main/lzero/policy/muzero.py) on two board_games: [TicTacToe](https://github.com/opendilab/LightZero/blob/main/zoo/board_games/tictactoe/envs/tictactoe_env.py), [Gomoku](https://github.com/opendilab/LightZero/blob/main/zoo/board_games/gomoku/envs/gomoku_env.py).

diff --git a/README.zh.md b/README.zh.md
@@ -76,16 +76,17 @@ LightZero 是基于 [PyTorch](https://pytorch.org/) 实现的 MCTS 算法库，
 
 LightZero 目前支持的环境及算法如下表所示：
 
-| Env./Alg.      | AlphaZero | MuZero | EfficientZero | Sampled EfficientZero | Gumbel MuZero |
-| ------------- | --------- | ------ | ------------- | --------------------- | ------------- |
-| Atari         | ---       | ✔    | ✔           | ✔                   | ✔             |
-| tictactoe     | ✔       | ✔    | 🔒          | 🔒                  | ✔             |
-| gomoku        | ✔       | ✔    | 🔒          | 🔒                  | ✔             |
-| go            | 🔒       | 🔒   | 🔒          | 🔒                  | 🔒             |
-| lunarlander | ---       | ✔    | ✔           | ✔                   | ✔             |
-| bipedalwalker   | ---       | ✔    | ✔           | ✔                   | 🔒             |
-| cartpole     | ---       | ✔    | ✔           | ✔                   | ✔             |
-| pendulum      | ---       | ✔    | ✔           | ✔                   | ✔             |
+| Env./Alg.     | AlphaZero | MuZero | EfficientZero | Sampled EfficientZero | Gumbel MuZero |
+|---------------| --------- |--------| ------- | --------------------- | ------------ |
+| Atari         | ---       | ✔      | ✔       | ✔                   | ✔            |
+| TicTacToe     | ✔       | ✔      | 🔒      | 🔒                  | ✔            |
+| Gomoku        | ✔       | ✔      | 🔒      | 🔒                  | ✔            |
+| Go            | 🔒       | 🔒     | 🔒      | 🔒                  | 🔒            |
+| LunarLander   | ---       | ✔      | ✔       | ✔                   | ✔            |
+| BipedalWalker | ---       | ✔      | ✔       | ✔                   | 🔒            |
+| CartPole      | ---       | ✔      | ✔       | ✔                   | ✔            |
+| Pendulum      | ---       | ✔      | ✔       | ✔                   | ✔            |
+| MuJoCo        | ---       | 🔒     | 🔒      | ✔                   | 🔒            |
 
 
 <sup>(1): "✔" 表示对应的项目已经完成并经过良好的测试。</sup>
@@ -140,14 +141,19 @@ python3 -u zoo/board_games/tictactoe/config/tictactoe_muzero_bot_mode_config.py
   <img src="assets/benchmark/ablation/mspacman_sez_K.png" alt="Image Description 4" width="23%" height="auto" style="margin: 0 1%;">
 </p>
 
-以下是使用 ``Factored/Gaussian`` 策略表征方法的 [Sampled EfficientZero](https://github.com/opendilab/LightZero/blob/main/lzero/policy/sampled_efficientzero.py) 在三个连续动作空间上的基线效果：[Pendulum-v1](https://github.com/opendilab/LightZero/blob/main/zoo/classic_control/pendulum/envs/pendulum_lightzero_env.py)，[LunarLanderContinuous-v2](https://github.com/opendilab/LightZero/blob/main/zoo/box2d/lunarlander/envs/lunarlander_env.py) 和 [BipedalWalker-v3](https://github.com/opendilab/LightZero/blob/main/zoo/box2d/bipedalwalker/envs/bipedalwalker_env.py)。
-> 其中 ``Factored Policy`` 表示 agent 学习一个输出离散分布的策略网络，上述三种环境手动离散化后的动作空间维度分别为11、49（7^2）和 256（4^4)。``Gaussian Policy``表示 agent 学习一个策略网络，该网络直接输出高斯分布的参数（mu 和 sigma）。
+以下是使用 ``Factored/Gaussian`` 策略表征方法的 [Sampled EfficientZero](https://github.com/opendilab/LightZero/blob/main/lzero/policy/sampled_efficientzero.py) 在5个连续动作空间环境上的基线效果：[Pendulum-v1](https://github.com/opendilab/LightZero/blob/main/zoo/classic_control/pendulum/envs/pendulum_lightzero_env.py)，[LunarLanderContinuous-v2](https://github.com/opendilab/LightZero/blob/main/zoo/box2d/lunarlander/envs/lunarlander_env.py)，[BipedalWalker-v3](https://github.com/opendilab/LightZero/blob/main/zoo/box2d/bipedalwalker/envs/bipedalwalker_env.py)，[Hopper-v3](https://github.com/opendilab/LightZero/blob/main/zoo/mujoco/envs/mujoco_lightzero_env.py) 和 [Walker2d-v3](https://github.com/opendilab/LightZero/blob/main/zoo/mujoco/envs/mujoco_lightzero_env.py).
+> 其中 ``Factored Policy`` 表示 agent 学习一个输出离散分布的策略网络，上述三种环境手动离散化后的动作空间维度分别为11、49（7^2、256（4^4)、64 (4^3) and 4096 (4^6)。``Gaussian Policy``表示 agent 学习一个策略网络，该网络直接输出高斯分布的参数（mu 和 sigma）。
 
 <p align="center">
-  <img src="assets/benchmark/main/pendulum_main.png" alt="Image Description 1" width="23%" height="auto" style="margin: 0 1%;">
-  <img src="assets/benchmark/ablation/pendulum_sez_K.png" alt="Image Description 2" width="23%" height="auto" style="margin: 0 1%;">
-  <img src="assets/benchmark/main/lunarlander_main.png" alt="Image Description 3" width="23%" height="auto" style="margin: 0 1%;">
-  <img src="assets/benchmark/main/bipedalwalker_main.png" alt="Image Description 3" width="23%" height="auto" style="margin: 0 1%;">
+  <img src="assets/benchmark/main/pendulum_main.png" alt="Image Description 1" width="33%" height="auto" style="margin: 0 1%;">
+  <img src="assets/benchmark/ablation/pendulum_sez_K.png" alt="Image Description 2" width="33%" height="auto" style="margin: 0 1%;">
+  <img src="assets/benchmark/main/lunarlander_main.png" alt="Image Description 3" width="33%" height="auto" style="margin: 0 1%;">
+</p>
+
+<p align="center">
+  <img src="assets/benchmark/main/bipedalwalker_main.png" alt="Image Description 3" width="33%" height="auto" style="margin: 0 1%;">
+  <img src="assets/benchmark/main/hopper_main.pdf" alt="Image Description 1" width="33%" height="auto" style="margin: 0 1%;">
+  <img src="assets/benchmark/main/walker2d_main.pdf" alt="Image Description 3" width="33%" height="auto" style="margin: 0 1%;">
 </p>
 
 以下是在两个棋类游戏（[TicTacToe(井字棋)](https://github.com/opendilab/LightZero/blob/main/zoo/board_games/tictactoe/envs/tictactoe_env.py) 和 [Gomoku(五子棋)](https://github.com/opendilab/LightZero/blob/main/zoo/board_games/gomoku/envs/gomoku_env.py)）上 [AlphaZero](https://github.com/opendilab/LightZero/blob/main/lzero/policy/alphazero.py) 和 [MuZero](https://github.com/opendilab/LightZero/blob/main/lzero/policy/muzero.py) 的基线效果：

diff --git a/lzero/policy/efficientzero.py b/lzero/policy/efficientzero.py
@@ -90,6 +90,10 @@ class EfficientZeroPolicy(Policy):
         augmentation=['shift', 'intensity'],
 
         # ******* learn ******
+        # (bool) Whether to ignore the done flag in the training data. Typically, this value is set to False.
+        # However, for some environments with a fixed episode length, to ensure the accuracy of Q-value calculations,
+        # we should set it to True to avoid the influence of the done flag.
+        ignore_done=False,
         # (int) How many updates(iterations) to train after collector's one collection.
         # Bigger "update_per_collect" means bigger off-policy.
         # collect data -> update policy-> collect data -> ...

diff --git a/lzero/policy/gumbel_muzero.py b/lzero/policy/gumbel_muzero.py
@@ -91,6 +91,10 @@ class GumeblMuZeroPolicy(Policy):
         augmentation=['shift', 'intensity'],
 
         # ******* learn ******
+        # (bool) Whether to ignore the done flag in the training data. Typically, this value is set to False.
+        # However, for some environments with a fixed episode length, to ensure the accuracy of Q-value calculations,
+        # we should set it to True to avoid the influence of the done flag.
+        ignore_done=False,
         # (int) How many updates(iterations) to train after collector's one collection.
         # Bigger "update_per_collect" means bigger off-policy.
         # collect data -> update policy-> collect data -> ...

diff --git a/lzero/policy/muzero.py b/lzero/policy/muzero.py
@@ -73,9 +73,9 @@ class MuZeroPolicy(Policy):
         collector_env_num=8,
         # (int) The number of environments used in evaluating policy.
         evaluator_env_num=3,
-        # (str) The type of environment. Options is ['not_board_games', 'board_games'].
+        # (str) The type of environment. Options are ['not_board_games', 'board_games'].
         env_type='not_board_games',
-        # (str) The type of battle mode. Options is ['play_with_bot_mode', 'self_play_mode'].
+        # (str) The type of battle mode. Options are ['play_with_bot_mode', 'self_play_mode'].
         battle_mode='play_with_bot_mode',
         # (bool) Whether to monitor extra statistics in tensorboard.
         monitor_extra_statistics=True,
@@ -91,6 +91,10 @@ class MuZeroPolicy(Policy):
         augmentation=['shift', 'intensity'],
 
         # ******* learn ******
+        # (bool) Whether to ignore the done flag in the training data. Typically, this value is set to False.
+        # However, for some environments with a fixed episode length, to ensure the accuracy of Q-value calculations,
+        # we should set it to True to avoid the influence of the done flag.
+        ignore_done=False,
         # (int) How many updates(iterations) to train after collector's one collection.
         # Bigger "update_per_collect" means bigger off-policy.
         # collect data -> update policy-> collect data -> ...

diff --git a/lzero/policy/sampled_efficientzero.py b/lzero/policy/sampled_efficientzero.py
@@ -96,7 +96,11 @@ class SampledEfficientZeroPolicy(Policy):
         # (list) The style of augmentation.
         augmentation=['shift', 'intensity'],
 
-        # ******* learn ******
+        # ****** learn ******
+        # (bool) Whether to ignore the done flag in the training data. Typically, this value is set to False.
+        # However, for some environments with a fixed episode length, to ensure the accuracy of Q-value calculations,
+        # we should set it to True to avoid the influence of the done flag.
+        ignore_done=False,
         # (int) How many updates(iterations) to train after collector's one collection.
         # Bigger "update_per_collect" means bigger off-policy.
         # collect data -> update policy-> collect data -> ...