feature(xcy): add ReZero algo. and related configs (#238)

* feature(xcy):print tree and reuse the node * polish(xcy): change test file * feature(xcy): add node graph and reuse to test * feature(xcy): add my algorithm * feature(xcy): add my algorithm and the config file * feature(xcy):add reuse and search_and_save * feature(xcy): add breakout configs * feature(xcy): add pong config * feature(xcy):add buffer time log * feature(xcy): add big batch code * feature(xcy):add big batch and speed test * polish(xcy): polish speed test * feature(xcy):add speed and memory log * feature(xcy):change to final framework * feature(xcy): add configs and reanalyze freq * Added Untitled Diagram.drawio * feature(xcy):add MCTS collect and change cnode * feature(pu): add rezero configs for lunarlander/connect4/gomoku * feature(xcy): add qbert and upndown * feature(xcy):add reez * polish(xcy):little typos * polish(xcy):requirements * polish(pu): polish configs * polish(xcy):test ez ratio1 * feature(xcy): return to gym atari * polish(xcy):change configs * polish(xcy):change config * polish(xcy):change config * polish(pu): polish configs * feature(xcy):fix ma and add nrma * polish(xcy):add config * polish(xcy):polish data process for reuse search * feature(xcy):refractor the code * feature(xcy):back to gym * add some config * feature(xcy):Refactor policy collect and reuse * polish(pu): polish game_buffer_rezero and configs * polish(pu): polish search_with_reuse and policy * polish(pu): polish muzero_collector * polish(pu): polish rezero configs * fix(pu): fix muzero_collector.py * polish(pu): polish train_rezero.py * polish(pu): polish buffer_reanalyze_freq * polish(pu): polish padding action comments * polish(pu): rename model_update_ratio to replay_ratio * polish(pu): polish comments * polish(pu): polish rezero configs * polish(pu): add rezero in readme algo/env tables --------- Co-authored-by: HarryXuancy <lhsxxcy@126.com> Co-authored-by: HarryXuancy <52876902+HarryXuancy@users.noreply.github.com> Co-authored-by: jiayilee65 <jiayilee65@163.com>
opendilab · Jun 28, 2024 · 87de8f9 · 87de8f9
1 parent 61e8960
commit 87de8f9
Show file tree

Hide file tree

Showing 82 changed files with 3,363 additions and 313 deletions.
diff --git a/README.md b/README.md
@@ -122,23 +122,25 @@ LightZero is a library with a [PyTorch](https://pytorch.org/) implementation of
 
 The environments and algorithms currently supported by LightZero are shown in the table below:
 
-| Env./Algo.    | AlphaZero | MuZero | EfficientZero | Sampled EfficientZero | Gumbel MuZero | Stochastic MuZero | UniZero |
-|---------------| -------- | ------ |-------------| ------------------ | ---------- |----------------|---------------|
-| TicTacToe     | ✔      | ✔      | 🔒           | 🔒                | ✔          | 🔒             |✔|
-| Gomoku        | ✔      | ✔      | 🔒          | 🔒               | ✔          | 🔒             |✔|
-| Connect4      | ✔      | ✔      | 🔒          | 🔒               | 🔒           | 🔒             |✔|
-| 2048          | ---       | ✔      | 🔒            | 🔒                | 🔒           | ✔              |✔|
-| Chess         | 🔒      | 🔒     | 🔒          | 🔒               | 🔒         | 🔒             |🔒|
-| Go            | 🔒      | 🔒     | 🔒          | 🔒               | 🔒         | 🔒             |🔒|
-| CartPole      | ---      | ✔      | ✔           | ✔                | ✔          | ✔              |✔|
-| Pendulum      | ---      | ✔      | ✔           | ✔                | ✔          | ✔              |🔒|
-| LunarLander   | ---      | ✔      | ✔           | ✔                | ✔          | ✔              |✔|
-| BipedalWalker | ---      | ✔      | ✔           | ✔                | ✔          | 🔒              |🔒|
-| Atari         | ---      | ✔      | ✔           | ✔                | ✔          | ✔              |✔|
-| MuJoCo        | ---      | ✔     | ✔          | ✔                | 🔒         | 🔒               |🔒|
-| MiniGrid      | ---      | ✔     | ✔          | ✔               | 🔒         | 🔒             |✔|
-| Bsuite        | ---      | ✔     | ✔          | ✔               | 🔒         | 🔒             |✔|
-| Memory        | ---      | ✔     | ✔          | ✔               | 🔒         | 🔒             |✔|
+
+| Env./Algo.    | AlphaZero | MuZero | EfficientZero | Sampled EfficientZero | Gumbel MuZero | Stochastic MuZero | UniZero |ReZero |
+|---------------| -------- | ------ |-------------| ------------------ | ---------- |----------------|---------------|----------------|
+| TicTacToe     | ✔      | ✔      | 🔒           | 🔒                | ✔          | 🔒             |✔|🔒             |
+| Gomoku        | ✔      | ✔      | 🔒          | 🔒               | ✔          | 🔒             |✔|✔          |
+| Connect4      | ✔      | ✔      | 🔒          | 🔒               | 🔒           | 🔒             |✔|✔          |
+| 2048          | ---       | ✔      | 🔒            | 🔒                | 🔒           | ✔              |✔|🔒             |
+| Chess         | 🔒      | 🔒     | 🔒          | 🔒               | 🔒         | 🔒             |🔒|🔒             |
+| Go            | 🔒      | 🔒     | 🔒          | 🔒               | 🔒         | 🔒             |🔒|🔒             |
+| CartPole      | ---      | ✔      | ✔           | ✔                | ✔          | ✔              |✔|✔             |
+| Pendulum      | ---      | ✔      | ✔           | ✔                | ✔          | ✔              |🔒|🔒             |
+| LunarLander   | ---      | ✔      | ✔           | ✔                | ✔          | ✔              |✔|🔒             |
+| BipedalWalker | ---      | ✔      | ✔           | ✔                | ✔          | 🔒              |🔒|🔒             |
+| Atari         | ---      | ✔      | ✔           | ✔                | ✔          | ✔              |✔|✔          |
+| MuJoCo        | ---      | ✔     | ✔          | ✔                | 🔒         | 🔒               |🔒|🔒             |
+| MiniGrid      | ---      | ✔     | ✔          | ✔               | 🔒         | 🔒             |✔|🔒             |
+| Bsuite        | ---      | ✔     | ✔          | ✔               | 🔒         | 🔒             |✔|🔒             |
+| Memory        | ---      | ✔     | ✔          | ✔               | 🔒         | 🔒             |✔|🔒             |
+
 
 <sup>(1): "✔" means that the corresponding item is finished and well-tested.</sup>
 

diff --git a/README.zh.md b/README.zh.md
@@ -110,23 +110,23 @@ LightZero 是基于 [PyTorch](https://pytorch.org/) 实现的 MCTS 算法库，
 
 LightZero 目前支持的环境及算法如下表所示：
 
-| Env./Algo.    | AlphaZero | MuZero | EfficientZero | Sampled EfficientZero | Gumbel MuZero | Stochastic MuZero | UniZero |
-|---------------| -------- | ------ |-------------| ------------------ | ---------- |----------------|---------------|
-| TicTacToe     | ✔      | ✔      | 🔒           | 🔒                | ✔          | 🔒             |✔|
-| Gomoku        | ✔      | ✔      | 🔒          | 🔒               | ✔          | 🔒             |✔|
-| Connect4      | ✔      | ✔      | 🔒          | 🔒               | 🔒           | 🔒             |✔|
-| 2048          | ---       | ✔      | 🔒            | 🔒                | 🔒           | ✔              |✔|
-| Chess         | 🔒      | 🔒     | 🔒          | 🔒               | 🔒         | 🔒             |🔒|
-| Go            | 🔒      | 🔒     | 🔒          | 🔒               | 🔒         | 🔒             |🔒|
-| CartPole      | ---      | ✔      | ✔           | ✔                | ✔          | ✔              |✔|
-| Pendulum      | ---      | ✔      | ✔           | ✔                | ✔          | ✔              |🔒|
-| LunarLander   | ---      | ✔      | ✔           | ✔                | ✔          | ✔              |✔|
-| BipedalWalker | ---      | ✔      | ✔           | ✔                | ✔          | 🔒              |🔒|
-| Atari         | ---      | ✔      | ✔           | ✔                | ✔          | ✔              |✔|
-| MuJoCo        | ---      | ✔     | ✔          | ✔                | 🔒         | 🔒               |🔒|
-| MiniGrid      | ---      | ✔     | ✔          | ✔               | 🔒         | 🔒             |✔|
-| Bsuite        | ---      | ✔     | ✔          | ✔               | 🔒         | 🔒             |✔|
-| Memory        | ---      | ✔     | ✔          | ✔               | 🔒         | 🔒             |✔|
+| Env./Algo.    | AlphaZero | MuZero | EfficientZero | Sampled EfficientZero | Gumbel MuZero | Stochastic MuZero | UniZero |ReZero |
+|---------------| -------- | ------ |-------------| ------------------ | ---------- |----------------|---------------|----------------|
+| TicTacToe     | ✔      | ✔      | 🔒           | 🔒                | ✔          | 🔒             |✔|🔒             |
+| Gomoku        | ✔      | ✔      | 🔒          | 🔒               | ✔          | 🔒             |✔|✔          |
+| Connect4      | ✔      | ✔      | 🔒          | 🔒               | 🔒           | 🔒             |✔|✔          |
+| 2048          | ---       | ✔      | 🔒            | 🔒                | 🔒           | ✔              |✔|🔒             |
+| Chess         | 🔒      | 🔒     | 🔒          | 🔒               | 🔒         | 🔒             |🔒|🔒             |
+| Go            | 🔒      | 🔒     | 🔒          | 🔒               | 🔒         | 🔒             |🔒|🔒             |
+| CartPole      | ---      | ✔      | ✔           | ✔                | ✔          | ✔              |✔|✔             |
+| Pendulum      | ---      | ✔      | ✔           | ✔                | ✔          | ✔              |🔒|🔒             |
+| LunarLander   | ---      | ✔      | ✔           | ✔                | ✔          | ✔              |✔|🔒             |
+| BipedalWalker | ---      | ✔      | ✔           | ✔                | ✔          | 🔒              |🔒|🔒             |
+| Atari         | ---      | ✔      | ✔           | ✔                | ✔          | ✔              |✔|✔          |
+| MuJoCo        | ---      | ✔     | ✔          | ✔                | 🔒         | 🔒               |🔒|🔒             |
+| MiniGrid      | ---      | ✔     | ✔          | ✔               | 🔒         | 🔒             |✔|🔒             |
+| Bsuite        | ---      | ✔     | ✔          | ✔               | 🔒         | 🔒             |✔|🔒             |
+| Memory        | ---      | ✔     | ✔          | ✔               | 🔒         | 🔒             |✔|🔒             |
 
 <sup>(1): "✔" 表示对应的项目已经完成并经过良好的测试。</sup>
 

diff --git a/lzero/agent/alphazero.py b/lzero/agent/alphazero.py
@@ -198,9 +198,9 @@ def train(
             new_data = sum(new_data, [])
 
             if self.cfg.policy.update_per_collect is None:
-                # update_per_collect is None, then update_per_collect is set to the number of collected transitions multiplied by the model_update_ratio.
+                # update_per_collect is None, then update_per_collect is set to the number of collected transitions multiplied by the replay_ratio.
                 collected_transitions_num = len(new_data)
-                update_per_collect = int(collected_transitions_num * self.cfg.policy.model_update_ratio)
+                update_per_collect = int(collected_transitions_num * self.cfg.policy.replay_ratio)
             replay_buffer.push(new_data, cur_collector_envstep=collector.envstep)
 
             # Learn policy from collected data

diff --git a/lzero/agent/efficientzero.py b/lzero/agent/efficientzero.py
@@ -228,9 +228,9 @@ def train(
             # Collect data by default config n_sample/n_episode.
             new_data = collector.collect(train_iter=learner.train_iter, policy_kwargs=collect_kwargs)
             if self.cfg.policy.update_per_collect is None:
-                # update_per_collect is None, then update_per_collect is set to the number of collected transitions multiplied by the model_update_ratio.
+                # update_per_collect is None, then update_per_collect is set to the number of collected transitions multiplied by the replay_ratio.
                 collected_transitions_num = sum([len(game_segment) for game_segment in new_data[0]])
-                update_per_collect = int(collected_transitions_num * self.cfg.policy.model_update_ratio)
+                update_per_collect = int(collected_transitions_num * self.cfg.policy.replay_ratio)
             # save returned new_data collected by the collector
             replay_buffer.push_game_segments(new_data)
             # remove the oldest data if the replay buffer is full.

diff --git a/lzero/agent/gumbel_muzero.py b/lzero/agent/gumbel_muzero.py
@@ -228,9 +228,9 @@ def train(
             # Collect data by default config n_sample/n_episode.
             new_data = collector.collect(train_iter=learner.train_iter, policy_kwargs=collect_kwargs)
             if self.cfg.policy.update_per_collect is None:
-                # update_per_collect is None, then update_per_collect is set to the number of collected transitions multiplied by the model_update_ratio.
+                # update_per_collect is None, then update_per_collect is set to the number of collected transitions multiplied by the replay_ratio.
                 collected_transitions_num = sum([len(game_segment) for game_segment in new_data[0]])
-                update_per_collect = int(collected_transitions_num * self.cfg.policy.model_update_ratio)
+                update_per_collect = int(collected_transitions_num * self.cfg.policy.replay_ratio)
             # save returned new_data collected by the collector
             replay_buffer.push_game_segments(new_data)
             # remove the oldest data if the replay buffer is full.

diff --git a/lzero/agent/muzero.py b/lzero/agent/muzero.py
@@ -228,9 +228,9 @@ def train(
             # Collect data by default config n_sample/n_episode.
             new_data = collector.collect(train_iter=learner.train_iter, policy_kwargs=collect_kwargs)
             if self.cfg.policy.update_per_collect is None:
-                # update_per_collect is None, then update_per_collect is set to the number of collected transitions multiplied by the model_update_ratio.
+                # update_per_collect is None, then update_per_collect is set to the number of collected transitions multiplied by the replay_ratio.
                 collected_transitions_num = sum([len(game_segment) for game_segment in new_data[0]])
-                update_per_collect = int(collected_transitions_num * self.cfg.policy.model_update_ratio)
+                update_per_collect = int(collected_transitions_num * self.cfg.policy.replay_ratio)
             # save returned new_data collected by the collector
             replay_buffer.push_game_segments(new_data)
             # remove the oldest data if the replay buffer is full.

diff --git a/lzero/agent/sampled_alphazero.py b/lzero/agent/sampled_alphazero.py
@@ -198,9 +198,9 @@ def train(
             new_data = sum(new_data, [])
 
             if self.cfg.policy.update_per_collect is None:
-                # update_per_collect is None, then update_per_collect is set to the number of collected transitions multiplied by the model_update_ratio.
+                # update_per_collect is None, then update_per_collect is set to the number of collected transitions multiplied by the replay_ratio.
                 collected_transitions_num = len(new_data)
-                update_per_collect = int(collected_transitions_num * self.cfg.policy.model_update_ratio)
+                update_per_collect = int(collected_transitions_num * self.cfg.policy.replay_ratio)
             replay_buffer.push(new_data, cur_collector_envstep=collector.envstep)
 
             # Learn policy from collected data

diff --git a/lzero/agent/sampled_efficientzero.py b/lzero/agent/sampled_efficientzero.py
@@ -228,9 +228,9 @@ def train(
             # Collect data by default config n_sample/n_episode.
             new_data = collector.collect(train_iter=learner.train_iter, policy_kwargs=collect_kwargs)
             if self.cfg.policy.update_per_collect is None:
-                # update_per_collect is None, then update_per_collect is set to the number of collected transitions multiplied by the model_update_ratio.
+                # update_per_collect is None, then update_per_collect is set to the number of collected transitions multiplied by the replay_ratio.
                 collected_transitions_num = sum([len(game_segment) for game_segment in new_data[0]])
-                update_per_collect = int(collected_transitions_num * self.cfg.policy.model_update_ratio)
+                update_per_collect = int(collected_transitions_num * self.cfg.policy.replay_ratio)
             # save returned new_data collected by the collector
             replay_buffer.push_game_segments(new_data)
             # remove the oldest data if the replay buffer is full.

diff --git a/lzero/entry/__init__.py b/lzero/entry/__init__.py
@@ -4,4 +4,5 @@
 from .train_muzero_with_reward_model import train_muzero_with_reward_model
 from .eval_muzero import eval_muzero
 from .eval_muzero_with_gym_env import eval_muzero_with_gym_env
-from .train_muzero_with_gym_env import train_muzero_with_gym_env
+from .train_muzero_with_gym_env import train_muzero_with_gym_env
+from .train_rezero import train_rezero
diff --git a/lzero/entry/train_alphazero.py b/lzero/entry/train_alphazero.py
@@ -119,9 +119,9 @@ def train_alphazero(
         new_data = collector.collect(train_iter=learner.train_iter, policy_kwargs=collect_kwargs)
         new_data = sum(new_data, [])
         if cfg.policy.update_per_collect is None:
-            # update_per_collect is None, then update_per_collect is set to the number of collected transitions multiplied by the model_update_ratio.
+            # update_per_collect is None, then update_per_collect is set to the number of collected transitions multiplied by the replay_ratio.
             collected_transitions_num = len(new_data)
-            update_per_collect = int(collected_transitions_num * cfg.policy.model_update_ratio)
+            update_per_collect = int(collected_transitions_num * cfg.policy.replay_ratio)
         replay_buffer.push(new_data, cur_collector_envstep=collector.envstep)
 
         # Learn policy from collected data

diff --git a/lzero/entry/train_muzero.py b/lzero/entry/train_muzero.py
@@ -8,12 +8,12 @@
 from ding.envs import create_env_manager
 from ding.envs import get_vec_env_setting
 from ding.policy import create_policy
-from ding.utils import set_pkg_seed, get_rank
 from ding.rl_utils import get_epsilon_greedy_fn
+from ding.utils import set_pkg_seed, get_rank
 from ding.worker import BaseLearner
 from tensorboardX import SummaryWriter
 
-from lzero.entry.utils import log_buffer_memory_usage
+from lzero.entry.utils import log_buffer_memory_usage, log_buffer_run_time
 from lzero.policy import visit_count_temperature
 from lzero.policy.random_policy import LightZeroRandomPolicy
 from lzero.worker import MuZeroCollector as Collector
@@ -69,7 +69,6 @@ def train_muzero(
     cfg = compile_config(cfg, seed=seed, env=None, auto=True, create_cfg=create_cfg, save_cfg=True)
     # Create main components: env, policy
     env_fn, collector_env_cfg, evaluator_env_cfg = get_vec_env_setting(cfg.env)
-
     collector_env = create_env_manager(cfg.env.manager, [partial(env_fn, cfg=c) for c in collector_env_cfg])
     evaluator_env = create_env_manager(cfg.env.manager, [partial(env_fn, cfg=c) for c in evaluator_env_cfg])
 
@@ -138,6 +137,7 @@ def train_muzero(
 
     while True:
         log_buffer_memory_usage(learner.train_iter, replay_buffer, tb_logger)
+        log_buffer_run_time(learner.train_iter, replay_buffer, tb_logger)
         collect_kwargs = {}
         # set temperature for visit count distributions according to the train_iter,
         # please refer to Appendix D in MuZero paper for details.
@@ -172,9 +172,9 @@ def train_muzero(
         # Collect data by default config n_sample/n_episode.
         new_data = collector.collect(train_iter=learner.train_iter, policy_kwargs=collect_kwargs)
         if cfg.policy.update_per_collect is None:
-            # update_per_collect is None, then update_per_collect is set to the number of collected transitions multiplied by the model_update_ratio.
+            # update_per_collect is None, then update_per_collect is set to the number of collected transitions multiplied by the replay_ratio.
             collected_transitions_num = sum([len(game_segment) for game_segment in new_data[0]])
-            update_per_collect = int(collected_transitions_num * cfg.policy.model_update_ratio)
+            update_per_collect = int(collected_transitions_num * cfg.policy.replay_ratio)
         # save returned new_data collected by the collector
         replay_buffer.push_game_segments(new_data)
         # remove the oldest data if the replay buffer is full.

diff --git a/lzero/entry/train_muzero_with_gym_env.py b/lzero/entry/train_muzero_with_gym_env.py
@@ -136,9 +136,9 @@ def train_muzero_with_gym_env(
         # Collect data by default config n_sample/n_episode.
         new_data = collector.collect(train_iter=learner.train_iter, policy_kwargs=collect_kwargs)
         if cfg.policy.update_per_collect is None:
-            # update_per_collect is None, then update_per_collect is set to the number of collected transitions multiplied by the model_update_ratio.
+            # update_per_collect is None, then update_per_collect is set to the number of collected transitions multiplied by the replay_ratio.
             collected_transitions_num = sum([len(game_segment) for game_segment in new_data[0]])
-            update_per_collect = int(collected_transitions_num * cfg.policy.model_update_ratio)
+            update_per_collect = int(collected_transitions_num * cfg.policy.replay_ratio)
         # save returned new_data collected by the collector
         replay_buffer.push_game_segments(new_data)
         # remove the oldest data if the replay buffer is full.