V3.0 implementation design

**Version3 is now online: https://github.com/DLR-RM/stable-baselines3**

Hello,

Before starting the migration to tf2 for stable baselines v3, I would like to discuss some design point we should agree on.

### Which tf paradigm should we use?

I would go for pytorch-like "eager mode", wrapping the method using a `tf.function` to improve the performance (as it is done [here](https://github.com/openai/baselines/blob/tf2/baselines/ppo2/model.py)).
The define-by-run is usually easier to read and debug (and I can compare it to my internal pytorch version). Wrapping it up with a  `tf.function`  should preserve performances.

### What is the roadmap?

My idea would be:
1. Refactor common folder (as done by @Miffyli in #540 )
2. Implement one on-policy algorithm and one off-policy:
I would go for PPO/TD3 and I can be in charge of that.
This would allow to discuss concrete implementation details.
3. Implement the rest, in order:
- SAC
- A2C
- DQN
- DDPG
- HER
- TRPO

4. Implement the recurrent versions?

I'm afraid that the remaining ones (ACKTR, GAIL and ACER) are not the easiest one to implement.
And for GAIL, we can refer to https://github.com/HumanCompatibleAI/imitation by @AdamGleave et al.

### Is there other breaking changes we should do? Change in the interface?

Some answers to this questions are linked here: https://github.com/hill-a/stable-baselines/issues/366

There are different things that I would like to change/add.

First, it would be adding the evaluation in the training loop. That is to say, we allow use to pass an `eval_env` on which the agent will be evaluated every `eval_freq` for `n_eval_episodes`. This is a true measure of the agent performance compared to training reward.

I would like to manipulate only `VecEnv` in the algorithm (and wrap the gym.Env automatically if necessary) this simplify the thing (so we don't have to think about what is the type of the env). Currently, we are using an `UnVecEnvWrapper` which makes things complicated for DQN for instance.

Should we maintain MPI support? I would favor switching to `VecEnv` too, this remove a dependency and unify the rest. (and would maybe allow to have an easy way to multiprocess SAC/DDPG or TD3 (cf #324 )). This would mean that we will remove PPO1 too.

Next thing I would like to make default is the Monitor wrapper. This allow to retrieve statistics about the training and would remove the need of a buggy version of `total_episode_reward_logger` for computing reward (cf #143).

As discussed in an other issue, I would like to unify the learning rate schedule too (would not be too difficult).

I would like to unify also the parameters name (ex: `ent_coef` vs `ent_coeff`).

Anyway, I plan to do a PR and we can then discuss on that.

### Regarding the transition

As we will be switching to keras interface (at least for most of the layers), this will break previously saved models. I propose to create scripts that allow to convert old models to new SB version rather than try to be backward-compatible.

Pinging @hill-a @erniejunior @AdamGleave @Miffyli 

PS: I hope I did not forget any important point

**EDIT: the draft repo is here: https://github.com/Stable-Baselines-Team/stable-baselines-tf2** (ppo and td3 included for now)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V3.0 implementation design #576

Which tf paradigm should we use?

What is the roadmap?

Is there other breaking changes we should do? Change in the interface?

Regarding the transition

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

V3.0 implementation design #576

Description

Which tf paradigm should we use?

What is the roadmap?

Is there other breaking changes we should do? Change in the interface?

Regarding the transition

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions