- Tensorflow-gpu > 1.13 with eager execution, or tensorflow 2.x
- Tensorflow-probability 0.6.0
- OpenAI baselines
- OpenAI Gym
- Bootstrapped EBU (The basic algorithm of BEBU-UCB, BEBU-IDS and OB2I)
- Bootstrapped DQN
- EBU (we provide a clean implementation at https://github.com/Baichenjia/EBU)
- UCB-Bonus
The following command should train an agent on "Breakout".
python run_atari.py --env BreakoutNoFrameskip-v4 --reward-type ucb --ebu
The following commands should train an agent on "Breakout" with other baselines.
python run_atari.py --env BreakoutNoFrameskip-v4 --ebu
python run_atari.py --env BreakoutNoFrameskip-v4 --action-selection ucb --ebu
python run_atari.py --env BreakoutNoFrameskip-v4 --action-selection ids --ebu
(vote is used for evaluation)
python run_atari.py --env BreakoutNoFrameskip-v4 --action-selection vote --ebu
python run_atari.py --env BreakoutNoFrameskip-v4
python run_atari.py --env BreakoutNoFrameskip-v4 --action-selection vote
python run_atari.py --env BreakoutNoFrameskip-v4 --action-selection ucb
python run_atari.py --env BreakoutNoFrameskip-v4 --action-selection ids
Any method can combine with the Randomized Prior Function by using --prior
flag.
For example, run Bootstrapped DQN + Randomized Prior Function as
python run_atari.py --env BreakoutNoFrameskip-v4 --prior
deepq.py
contains stepping the environment, storing experience and saving models.deepq_learner.py
contains action-selection methods, bonus, bootstrapped DQN/EBU training.replay_buffer.py
contains two class of replay buffer for BDQN and BEBU, respectively. The memory consumption has been highly optimized.models.py
contains Q-network, Bootstrapped Q-network with multiple heads, Bootstrapped Q-network with Randomized Prior Function.run_atari.py
contains hyper-parameters setting. Run this file will start training.
The data for separate runs is stored on disk under the result
directory with filename
<env-id>-<algorithm>-<date>-<time>.
Each run directory contains
log.txt
Record the episode, exploration rate, episodic rewards in training (after normalization and used for training), episodic scores (raw scores), current timesteps, percentage completed.monitor.csv
Env monitor file by usinglogger
fromOpenai Baselines
.parameters.txt
All hyper-parameters used in training.progress.csv
Same data aslog.txt
but withcsv
format.evaluate scores.txt
Evaluation of policy for 108000 frames every 1e5 training steps with 30 no-op evaluation.model_10M.h5
,model_20M.h5
,model_best_10M.h5
,model_best_20M.h5
are the policy files saved.