-
Notifications
You must be signed in to change notification settings - Fork 727
Description
Describe the bug
I have compared the original results of SAC from https://arxiv.org/abs/1812.05905 Figure 1 and https://arxiv.org/abs/1801.01290 Figure 1 to the one from stable-baselines. Yes, there are two kinds of SAC implementations, one uses two Qs, the other is like stable-baselines SAC implementation (using Q and V). They should all work. However, the one from stable-baselines (I have tried 3 seeds: 0, 1, 2) cannot achieve as good results as the paper. Furthermore, Why are the lines so jittery/wobbly (e.g. episode reward of Walker2d-v2), as if there is no convergence (Is this reason #726?).
Below are the episode reward results (logged by default stable-baselines tensorboard log by assign tensorboard_log="./sac/{}_tensorboard/") of stable-baselines implemented SAC (seed 0, 1, 2):
Half-Cheetah-v2 (seed 0, 1, 2):

Code example
Please try to provide a minimal example to reproduce the bug. Error messages and stack traces are also helpful.
reproduce.py:
import gym
import sys, os
sys.path.append('..')
from stable_baselines import GAIL, SAC
from stable_baselines.gail import ExpertDataset, generate_expert_traj
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--env_id', type=str, default='Walker2d-v2', help='Environment Name, e.g. (Walker2d-v2)')
parser.add_argument('--expert_data_dir', type=str, default='gail_expert', help='Directory to store expert data')
parser.add_argument('--sac_ckpt_path', type=str, default=None, help='Directory to load SAC ckpt')
args = parser.parse_args()
# Generate expert trajectories (train expert)
print('Generating expert dataset ...')
model = SAC('MlpPolicy', '{}'.format(args.env_id), verbose=1, tensorboard_log="./sac/{}_tensorboard/".format(args.env_id))
if not os.path.exists(args.gail_ckpt_dir):
os.makedirs(args.gail_ckpt_dir)
if args.sac_ckpt_path:
model.load(args.sac_ckpt_path)
generate_expert_traj(model, os.path.join(args.expert_data_dir, 'expert_{}'.format(args.env_id)), n_timesteps=10000000, n_episodes=10)for ENV_ID in 'Walker2d-v2' 'Hopper-v2' 'Humanoid-v2' 'Ant-v2' 'HalfCheetah-v2'
do
python reproduce.py --env_id $ENV_ID&
doneI use default Hyperparamers without changing:
gamma=0.99,
learning_rate=3e-4,
buffer_size=50000,
learning_starts=100,
train_freq=1,
batch_size=64,
tau=0.005,
ent_coef='auto',
target_update_interval=1,
gradient_steps=1,
target_entropy='auto',
random_exploration=0.0
System Info
Describe the characteristic of your environment:
- Ubuntu 16.04
- GPU: GTX Titan, CUDA 10.1, Driver Version: 418.56
- Python 3.7
- Tensorflow 1.15
- stable-baselines ~= 2.10.0
- mujoco-py 2.0.2.10,
git clone https://github.com/openai/mujoco-pyandpython setup.py install - mujoco 200
Additional context
The performance difference between Mujoco benchmark {HalfCheetah, Walker2d, Ant, ...}-v2 and {HalfCheetah, Walker2d, Ant, ...}-v1 should be similar (openai/gym#1293, openai/gym#834).

