Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(pu): add lightzero mujoco env and related sampled efficientzero configs #50

Merged
merged 2 commits into from
Jul 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 20 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,14 +116,15 @@ The environments and algorithms currently supported by LightZero are shown in th

| Env./Alg. | AlphaZero | MuZero | EfficientZero | Sampled EfficientZero | Gumbel MuZero |
| ------------- | --------- | ------ | ------------- | --------------------- | ------------- |
| Atari | --- | ✔ | ✔ | ✔ | ✔ |
| tictactoe | ✔ | ✔ | 🔒 | 🔒 | ✔ |
| gomoku | ✔ | ✔ | 🔒 | 🔒 | ✔ |
| go | 🔒 | 🔒 | 🔒 | 🔒 | 🔒 |
| lunarlander | --- | ✔ | ✔ | ✔ | ✔ |
| bipedalwalker | --- | ✔ | ✔ | ✔ | 🔒 |
| cartpole | --- | ✔ | ✔ | ✔ | ✔ |
| pendulum | --- | ✔ | ✔ | ✔ | ✔ |
| Atari | --- | ✔ | ✔ | ✔ | ✔ |
| TicTacToe | ✔ | ✔ | 🔒 | 🔒 | ✔ |
| Gomoku | ✔ | ✔ | 🔒 | 🔒 | ✔ |
| Go | 🔒 | 🔒 | 🔒 | 🔒 | 🔒 |
| LunarLander | --- | ✔ | ✔ | ✔ | ✔ |
| BipedalWalker | --- | ✔ | ✔ | ✔ | 🔒 |
| CartPole | --- | ✔ | ✔ | ✔ | ✔ |
| Pendulum | --- | ✔ | ✔ | ✔ | ✔ |
| MuJoCo | --- | 🔒 | 🔒 | ✔ | 🔒 |

<sup>(1): "✔" means that the corresponding item is finished and well-tested.</sup>

Expand Down Expand Up @@ -181,13 +182,18 @@ Below are the benchmark results of [MuZero](https://github.com/opendilab/LightZe
</p>


Below are the benchmark results of [Sampled EfficientZero](https://github.com/opendilab/LightZero/blob/main/lzero/policy/sampled_efficientzero.py) with ``Factored/Gaussian`` policy representation on three continuous action space games: [Pendulum-v1](https://github.com/opendilab/LightZero/blob/main/zoo/classic_control/pendulum/envs/pendulum_lightzero_env.py), [LunarLanderContinuous-v2](https://github.com/opendilab/LightZero/blob/main/zoo/box2d/lunarlander/envs/lunarlander_env.py), [BipedalWalker-v3](https://github.com/opendilab/LightZero/blob/main/zoo/box2d/bipedalwalker/envs/bipedalwalker_env.py).
> Where ``Factored Policy`` indicates that the agent learns a policy network that outputs a categorical distribution, the dimensions of the action space for the three environments are 11, 49 (7^2), and 256 (4^4), respectively, after manual discretization. On the other hand, ``Gaussian Policy`` indicates that the agent learns a policy network that outputs parameters (mu and sigma) for a Gaussian distribution.
Below are the benchmark results of [Sampled EfficientZero](https://github.com/opendilab/LightZero/blob/main/lzero/policy/sampled_efficientzero.py) with ``Factored/Gaussian`` policy representation on three classic continuous action space games: [Pendulum-v1](https://github.com/opendilab/LightZero/blob/main/zoo/classic_control/pendulum/envs/pendulum_lightzero_env.py), [LunarLanderContinuous-v2](https://github.com/opendilab/LightZero/blob/main/zoo/box2d/lunarlander/envs/lunarlander_env.py), [BipedalWalker-v3](https://github.com/opendilab/LightZero/blob/main/zoo/box2d/bipedalwalker/envs/bipedalwalker_env.py)
and two MuJoCo continuous action space games: [Hopper-v3](https://github.com/opendilab/LightZero/blob/main/zoo/mujoco/envs/mujoco_lightzero_env.py), [Walker2d-v3](https://github.com/opendilab/LightZero/blob/main/zoo/mujoco/envs/mujoco_lightzero_env.py).
> Where ``Factored Policy`` indicates that the agent learns a policy network that outputs a categorical distribution, the dimensions of the action space for the five environments are 11, 49 (7^2), 256 (4^4), 64 (4^3) and 4096 (4^6), respectively, after manual discretization. On the other hand, ``Gaussian Policy`` indicates that the agent learns a policy network that outputs parameters (mu and sigma) for a Gaussian distribution.
<p align="center">
<img src="assets/benchmark/main/pendulum_main.png" alt="Image Description 1" width="23%" height="auto" style="margin: 0 1%;">
<img src="assets/benchmark/ablation/pendulum_sez_K.png" alt="Image Description 2" width="23%" height="auto" style="margin: 0 1%;">
<img src="assets/benchmark/main/lunarlander_main.png" alt="Image Description 3" width="23%" height="auto" style="margin: 0 1%;">
<img src="assets/benchmark/main/bipedalwalker_main.png" alt="Image Description 3" width="23%" height="auto" style="margin: 0 1%;">
<img src="assets/benchmark/main/pendulum_main.png" alt="Image Description 1" width="33%" height="auto" style="margin: 0 1%;">
<img src="assets/benchmark/ablation/pendulum_sez_K.png" alt="Image Description 2" width="33%" height="auto" style="margin: 0 1%;">
<img src="assets/benchmark/main/lunarlander_main.png" alt="Image Description 3" width="33%" height="auto" style="margin: 0 1%;">
</p>
<p align="center">
<img src="assets/benchmark/main/bipedalwalker_main.png" alt="Image Description 3" width="33%" height="auto" style="margin: 0 1%;">
<img src="assets/benchmark/main/hopper_main.pdf" alt="Image Description 1" width="33%" height="auto" style="margin: 0 1%;">
<img src="assets/benchmark/main/walker2d_main.pdf" alt="Image Description 3" width="33%" height="auto" style="margin: 0 1%;">
</p>

Below are the benchmark results of [AlphaZero](https://github.com/opendilab/LightZero/blob/main/lzero/policy/alphazero.py) and [MuZero](https://github.com/opendilab/LightZero/blob/main/lzero/policy/muzero.py) on two board_games: [TicTacToe](https://github.com/opendilab/LightZero/blob/main/zoo/board_games/tictactoe/envs/tictactoe_env.py), [Gomoku](https://github.com/opendilab/LightZero/blob/main/zoo/board_games/gomoku/envs/gomoku_env.py).
Expand Down
38 changes: 22 additions & 16 deletions README.zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,16 +76,17 @@ LightZero 是基于 [PyTorch](https://pytorch.org/) 实现的 MCTS 算法库,

LightZero 目前支持的环境及算法如下表所示:

| Env./Alg. | AlphaZero | MuZero | EfficientZero | Sampled EfficientZero | Gumbel MuZero |
| ------------- | --------- | ------ | ------------- | --------------------- | ------------- |
| Atari | --- | ✔ | ✔ | ✔ | ✔ |
| tictactoe | ✔ | ✔ | 🔒 | 🔒 | ✔ |
| gomoku | ✔ | ✔ | 🔒 | 🔒 | ✔ |
| go | 🔒 | 🔒 | 🔒 | 🔒 | 🔒 |
| lunarlander | --- | ✔ | ✔ | ✔ | ✔ |
| bipedalwalker | --- | ✔ | ✔ | ✔ | 🔒 |
| cartpole | --- | ✔ | ✔ | ✔ | ✔ |
| pendulum | --- | ✔ | ✔ | ✔ | ✔ |
| Env./Alg. | AlphaZero | MuZero | EfficientZero | Sampled EfficientZero | Gumbel MuZero |
|---------------| --------- |--------| ------- | --------------------- | ------------ |
| Atari | --- | ✔ | ✔ | ✔ | ✔ |
| TicTacToe | ✔ | ✔ | 🔒 | 🔒 | ✔ |
| Gomoku | ✔ | ✔ | 🔒 | 🔒 | ✔ |
| Go | 🔒 | 🔒 | 🔒 | 🔒 | 🔒 |
| LunarLander | --- | ✔ | ✔ | ✔ | ✔ |
| BipedalWalker | --- | ✔ | ✔ | ✔ | 🔒 |
| CartPole | --- | ✔ | ✔ | ✔ | ✔ |
| Pendulum | --- | ✔ | ✔ | ✔ | ✔ |
| MuJoCo | --- | 🔒 | 🔒 | ✔ | 🔒 |


<sup>(1): "✔" 表示对应的项目已经完成并经过良好的测试。</sup>
Expand Down Expand Up @@ -140,14 +141,19 @@ python3 -u zoo/board_games/tictactoe/config/tictactoe_muzero_bot_mode_config.py
<img src="assets/benchmark/ablation/mspacman_sez_K.png" alt="Image Description 4" width="23%" height="auto" style="margin: 0 1%;">
</p>

以下是使用 ``Factored/Gaussian`` 策略表征方法的 [Sampled EfficientZero](https://github.com/opendilab/LightZero/blob/main/lzero/policy/sampled_efficientzero.py) 在三个连续动作空间上的基线效果:[Pendulum-v1](https://github.com/opendilab/LightZero/blob/main/zoo/classic_control/pendulum/envs/pendulum_lightzero_env.py),[LunarLanderContinuous-v2](https://github.com/opendilab/LightZero/blob/main/zoo/box2d/lunarlander/envs/lunarlander_env.py)[BipedalWalker-v3](https://github.com/opendilab/LightZero/blob/main/zoo/box2d/bipedalwalker/envs/bipedalwalker_env.py)
> 其中 ``Factored Policy`` 表示 agent 学习一个输出离散分布的策略网络,上述三种环境手动离散化后的动作空间维度分别为11、49(7^2)和 256(4^4)。``Gaussian Policy``表示 agent 学习一个策略网络,该网络直接输出高斯分布的参数(mu 和 sigma)。
以下是使用 ``Factored/Gaussian`` 策略表征方法的 [Sampled EfficientZero](https://github.com/opendilab/LightZero/blob/main/lzero/policy/sampled_efficientzero.py) 在5个连续动作空间环境上的基线效果:[Pendulum-v1](https://github.com/opendilab/LightZero/blob/main/zoo/classic_control/pendulum/envs/pendulum_lightzero_env.py),[LunarLanderContinuous-v2](https://github.com/opendilab/LightZero/blob/main/zoo/box2d/lunarlander/envs/lunarlander_env.py)[BipedalWalker-v3](https://github.com/opendilab/LightZero/blob/main/zoo/box2d/bipedalwalker/envs/bipedalwalker_env.py),[Hopper-v3](https://github.com/opendilab/LightZero/blob/main/zoo/mujoco/envs/mujoco_lightzero_env.py) 和 [Walker2d-v3](https://github.com/opendilab/LightZero/blob/main/zoo/mujoco/envs/mujoco_lightzero_env.py).
> 其中 ``Factored Policy`` 表示 agent 学习一个输出离散分布的策略网络,上述三种环境手动离散化后的动作空间维度分别为11、49(7^2256(4^4)、64 (4^3) and 4096 (4^6)。``Gaussian Policy``表示 agent 学习一个策略网络,该网络直接输出高斯分布的参数(mu 和 sigma)。

<p align="center">
<img src="assets/benchmark/main/pendulum_main.png" alt="Image Description 1" width="23%" height="auto" style="margin: 0 1%;">
<img src="assets/benchmark/ablation/pendulum_sez_K.png" alt="Image Description 2" width="23%" height="auto" style="margin: 0 1%;">
<img src="assets/benchmark/main/lunarlander_main.png" alt="Image Description 3" width="23%" height="auto" style="margin: 0 1%;">
<img src="assets/benchmark/main/bipedalwalker_main.png" alt="Image Description 3" width="23%" height="auto" style="margin: 0 1%;">
<img src="assets/benchmark/main/pendulum_main.png" alt="Image Description 1" width="33%" height="auto" style="margin: 0 1%;">
<img src="assets/benchmark/ablation/pendulum_sez_K.png" alt="Image Description 2" width="33%" height="auto" style="margin: 0 1%;">
<img src="assets/benchmark/main/lunarlander_main.png" alt="Image Description 3" width="33%" height="auto" style="margin: 0 1%;">
</p>

<p align="center">
<img src="assets/benchmark/main/bipedalwalker_main.png" alt="Image Description 3" width="33%" height="auto" style="margin: 0 1%;">
<img src="assets/benchmark/main/hopper_main.pdf" alt="Image Description 1" width="33%" height="auto" style="margin: 0 1%;">
<img src="assets/benchmark/main/walker2d_main.pdf" alt="Image Description 3" width="33%" height="auto" style="margin: 0 1%;">
</p>

以下是在两个棋类游戏([TicTacToe(井字棋)](https://github.com/opendilab/LightZero/blob/main/zoo/board_games/tictactoe/envs/tictactoe_env.py) 和 [Gomoku(五子棋)](https://github.com/opendilab/LightZero/blob/main/zoo/board_games/gomoku/envs/gomoku_env.py))上 [AlphaZero](https://github.com/opendilab/LightZero/blob/main/lzero/policy/alphazero.py) 和 [MuZero](https://github.com/opendilab/LightZero/blob/main/lzero/policy/muzero.py) 的基线效果:
Expand Down
4 changes: 4 additions & 0 deletions lzero/policy/efficientzero.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,10 @@ class EfficientZeroPolicy(Policy):
augmentation=['shift', 'intensity'],

# ******* learn ******
# (bool) Whether to ignore the done flag in the training data. Typically, this value is set to False.
# However, for some environments with a fixed episode length, to ensure the accuracy of Q-value calculations,
# we should set it to True to avoid the influence of the done flag.
ignore_done=False,
# (int) How many updates(iterations) to train after collector's one collection.
# Bigger "update_per_collect" means bigger off-policy.
# collect data -> update policy-> collect data -> ...
Expand Down
4 changes: 4 additions & 0 deletions lzero/policy/gumbel_muzero.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,10 @@ class GumeblMuZeroPolicy(Policy):
augmentation=['shift', 'intensity'],

# ******* learn ******
# (bool) Whether to ignore the done flag in the training data. Typically, this value is set to False.
# However, for some environments with a fixed episode length, to ensure the accuracy of Q-value calculations,
# we should set it to True to avoid the influence of the done flag.
ignore_done=False,
# (int) How many updates(iterations) to train after collector's one collection.
# Bigger "update_per_collect" means bigger off-policy.
# collect data -> update policy-> collect data -> ...
Expand Down
8 changes: 6 additions & 2 deletions lzero/policy/muzero.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,9 +73,9 @@ class MuZeroPolicy(Policy):
collector_env_num=8,
# (int) The number of environments used in evaluating policy.
evaluator_env_num=3,
# (str) The type of environment. Options is ['not_board_games', 'board_games'].
# (str) The type of environment. Options are ['not_board_games', 'board_games'].
env_type='not_board_games',
# (str) The type of battle mode. Options is ['play_with_bot_mode', 'self_play_mode'].
# (str) The type of battle mode. Options are ['play_with_bot_mode', 'self_play_mode'].
battle_mode='play_with_bot_mode',
# (bool) Whether to monitor extra statistics in tensorboard.
monitor_extra_statistics=True,
Expand All @@ -91,6 +91,10 @@ class MuZeroPolicy(Policy):
augmentation=['shift', 'intensity'],

# ******* learn ******
# (bool) Whether to ignore the done flag in the training data. Typically, this value is set to False.
# However, for some environments with a fixed episode length, to ensure the accuracy of Q-value calculations,
# we should set it to True to avoid the influence of the done flag.
ignore_done=False,
# (int) How many updates(iterations) to train after collector's one collection.
# Bigger "update_per_collect" means bigger off-policy.
# collect data -> update policy-> collect data -> ...
Expand Down
6 changes: 5 additions & 1 deletion lzero/policy/sampled_efficientzero.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,11 @@ class SampledEfficientZeroPolicy(Policy):
# (list) The style of augmentation.
augmentation=['shift', 'intensity'],

# ******* learn ******
# ****** learn ******
# (bool) Whether to ignore the done flag in the training data. Typically, this value is set to False.
# However, for some environments with a fixed episode length, to ensure the accuracy of Q-value calculations,
# we should set it to True to avoid the influence of the done flag.
ignore_done=False,
# (int) How many updates(iterations) to train after collector's one collection.
# Bigger "update_per_collect" means bigger off-policy.
# collect data -> update policy-> collect data -> ...
Expand Down
Loading