Skip to content

The performance of stable baselines implementation of SAC [question] #861

@Jiankai-Sun

Description

@Jiankai-Sun

Describe the bug
I have compared the original results of SAC from https://arxiv.org/abs/1812.05905 Figure 1 and https://arxiv.org/abs/1801.01290 Figure 1 to the one from stable-baselines. Yes, there are two kinds of SAC implementations, one uses two Qs, the other is like stable-baselines SAC implementation (using Q and V). They should all work. However, the one from stable-baselines (I have tried 3 seeds: 0, 1, 2) cannot achieve as good results as the paper. Furthermore, Why are the lines so jittery/wobbly (e.g. episode reward of Walker2d-v2), as if there is no convergence (Is this reason #726?).

Below are the episode reward results (logged by default stable-baselines tensorboard log by assign tensorboard_log="./sac/{}_tensorboard/") of stable-baselines implemented SAC (seed 0, 1, 2):

Half-Cheetah-v2 (seed 0, 1, 2):
image

Hopper-v2 (seed 0, 1, 2):
image

Walker2d-v2 (seed 0, 1, 2):
image

Code example
Please try to provide a minimal example to reproduce the bug. Error messages and stack traces are also helpful.

reproduce.py:

import gym
import sys, os
sys.path.append('..')

from stable_baselines import GAIL, SAC
from stable_baselines.gail import ExpertDataset, generate_expert_traj

import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--env_id', type=str, default='Walker2d-v2', help='Environment Name, e.g. (Walker2d-v2)')
parser.add_argument('--expert_data_dir', type=str, default='gail_expert', help='Directory to store expert data')
parser.add_argument('--sac_ckpt_path', type=str, default=None, help='Directory to load SAC ckpt')
args = parser.parse_args()

# Generate expert trajectories (train expert)
print('Generating expert dataset ...')
model = SAC('MlpPolicy', '{}'.format(args.env_id), verbose=1, tensorboard_log="./sac/{}_tensorboard/".format(args.env_id))
if not os.path.exists(args.gail_ckpt_dir):
    os.makedirs(args.gail_ckpt_dir)
if args.sac_ckpt_path:
    model.load(args.sac_ckpt_path)
generate_expert_traj(model, os.path.join(args.expert_data_dir, 'expert_{}'.format(args.env_id)), n_timesteps=10000000, n_episodes=10)
for ENV_ID in 'Walker2d-v2' 'Hopper-v2' 'Humanoid-v2' 'Ant-v2' 'HalfCheetah-v2'
do
  python reproduce.py --env_id $ENV_ID&
done

I use default Hyperparamers without changing:

gamma=0.99, 
learning_rate=3e-4, 
buffer_size=50000,
learning_starts=100, 
train_freq=1, 
batch_size=64,
tau=0.005, 
ent_coef='auto', 
target_update_interval=1,
gradient_steps=1, 
target_entropy='auto', 
random_exploration=0.0

System Info
Describe the characteristic of your environment:

  • Ubuntu 16.04
  • GPU: GTX Titan, CUDA 10.1, Driver Version: 418.56
  • Python 3.7
  • Tensorflow 1.15
  • stable-baselines ~= 2.10.0
  • mujoco-py 2.0.2.10, git clone https://github.com/openai/mujoco-py and python setup.py install
  • mujoco 200

Additional context
The performance difference between Mujoco benchmark {HalfCheetah, Walker2d, Ant, ...}-v2 and {HalfCheetah, Walker2d, Ant, ...}-v1 should be similar (openai/gym#1293, openai/gym#834).

Metadata

Metadata

Assignees

No one assigned

    Labels

    RTFMAnswer is the documentationquestionFurther information is requested

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions