The performance of stable baselines implementation of SAC [question]

**Describe the bug**
I have compared the original results of SAC from https://arxiv.org/abs/1812.05905 Figure 1 and https://arxiv.org/abs/1801.01290 Figure 1 to the one from stable-baselines. Yes, there are two kinds of SAC implementations, one uses [two Qs](https://github.com/rail-berkeley/softlearning/blob/master/softlearning/algorithms/sac.py#L65), the other is like [stable-baselines SAC implementation](https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/sac/sac.py#L171) (using Q and V). They should all work. However, the one from stable-baselines (I have tried 3 seeds: 0, 1, 2) cannot achieve as good results as the paper. Furthermore, Why are the lines so jittery/wobbly (e.g. episode reward of Walker2d-v2), as if there is no convergence (Is this reason https://github.com/hill-a/stable-baselines/issues/726?). 

Below are the episode reward results (logged by default stable-baselines tensorboard log by assign tensorboard_log="./sac/{}_tensorboard/") of stable-baselines implemented SAC (seed 0, 1, 2):

Half-Cheetah-v2 (seed 0, 1, 2):
![image](https://user-images.githubusercontent.com/25477396/82109314-7b0b8680-9767-11ea-9775-be3b163c4dfb.png)

Hopper-v2 (seed 0, 1, 2):
![image](https://user-images.githubusercontent.com/25477396/82109409-287e9a00-9768-11ea-803f-5e0b6598f60d.png)

Walker2d-v2 (seed 0, 1, 2):
![image](https://user-images.githubusercontent.com/25477396/82109422-45b36880-9768-11ea-87fd-10d85f873a13.png)




**Code example**
Please try to provide a minimal example to reproduce the bug. Error messages and stack traces are also helpful.

reproduce.py: 
```python
import gym
import sys, os
sys.path.append('..')

from stable_baselines import GAIL, SAC
from stable_baselines.gail import ExpertDataset, generate_expert_traj

import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--env_id', type=str, default='Walker2d-v2', help='Environment Name, e.g. (Walker2d-v2)')
parser.add_argument('--expert_data_dir', type=str, default='gail_expert', help='Directory to store expert data')
parser.add_argument('--sac_ckpt_path', type=str, default=None, help='Directory to load SAC ckpt')
args = parser.parse_args()

# Generate expert trajectories (train expert)
print('Generating expert dataset ...')
model = SAC('MlpPolicy', '{}'.format(args.env_id), verbose=1, tensorboard_log="./sac/{}_tensorboard/".format(args.env_id))
if not os.path.exists(args.gail_ckpt_dir):
    os.makedirs(args.gail_ckpt_dir)
if args.sac_ckpt_path:
    model.load(args.sac_ckpt_path)
generate_expert_traj(model, os.path.join(args.expert_data_dir, 'expert_{}'.format(args.env_id)), n_timesteps=10000000, n_episodes=10)
```

```bash
for ENV_ID in 'Walker2d-v2' 'Hopper-v2' 'Humanoid-v2' 'Ant-v2' 'HalfCheetah-v2'
do
  python reproduce.py --env_id $ENV_ID&
done
```
I use default Hyperparamers without changing:
```
gamma=0.99, 
learning_rate=3e-4, 
buffer_size=50000,
learning_starts=100, 
train_freq=1, 
batch_size=64,
tau=0.005, 
ent_coef='auto', 
target_update_interval=1,
gradient_steps=1, 
target_entropy='auto', 
random_exploration=0.0
```

**System Info**
Describe the characteristic of your environment:
 * Ubuntu 16.04
 * GPU: GTX Titan, CUDA 10.1, Driver Version: 418.56
 * Python 3.7
 * Tensorflow 1.15
 * stable-baselines ~= 2.10.0
 * mujoco-py 2.0.2.10, `git clone https://github.com/openai/mujoco-py` and `python setup.py install`
 * mujoco 200

**Additional context**
The performance difference between Mujoco benchmark {HalfCheetah, Walker2d, Ant, ...}-v2  and {HalfCheetah, Walker2d, Ant, ...}-v1 should be similar (https://github.com/openai/gym/issues/1293, https://github.com/openai/gym/pull/834).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The performance of stable baselines implementation of SAC [question] #861

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

The performance of stable baselines implementation of SAC [question] #861

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions