atari-py==0.1.1
gym==0.10.5
matplotlib==2.2.2
numpy==1.14.3
opencv-python==3.4.0.12
scipy==1.1.0
tensorflow-gpu==1.6.0
- Game Playing: Pong
- Implement an agent to play Atari games using Deep Reinforcement Learning.
- In this homework, you are required to implement Policy Gradient.
- The Pong environment is used in this homework.
- Improvements to Policy Gradient:
- Variance Reduction
- Natural Policy Gradient
- Trust Region Policy Optimization
- Proximal Policy Optimization
- Training Hint
- Reward normalization (More stable)
- Action space reduction (Only up and down)
- Getting averaging reward in 30 episodes over 3 in Pong
- Without OpenAI’s Atari wrapper & reward clipping
- Improvements to Policy Gradient are allowed
$ python3.6 test.py --test_pg
ep 0, reward: 18.000000
ep 1, reward: 17.000000
ep 2, reward: 18.000000
ep 3, reward: 19.000000
ep 4, reward: 13.000000
ep 5, reward: 17.000000
ep 6, reward: 9.000000
ep 7, reward: 20.000000
ep 8, reward: 16.000000
ep 9, reward: 17.000000
ep 10, reward: 12.000000
ep 11, reward: 19.000000
ep 12, reward: 16.000000
ep 13, reward: 18.000000
ep 14, reward: 18.000000
ep 15, reward: 19.000000
ep 16, reward: 15.000000
ep 17, reward: 12.000000
ep 18, reward: 15.000000
ep 19, reward: 21.000000
ep 20, reward: 17.000000
ep 21, reward: 18.000000
ep 22, reward: 13.000000
ep 23, reward: 18.000000
ep 24, reward: 19.000000
ep 25, reward: 20.000000
ep 26, reward: 17.000000
ep 27, reward: 15.000000
ep 28, reward: 14.000000
ep 29, reward: 14.000000
Run 30 episodes
Mean: 16.466666666666665
- Learning Curve of Original Policy Gradient
- Learning Curve of Policy Gradient with Proximal Policy Optimization (PPO)
- Comparison of Original PG and PG with PPO
- Game Playing: Breakout
- Implement an agent to play Atari games using Deep Reinforcement Learning.
- In this homework, you are required to implement Deep Q-Learning (DQN).
- The Breakout environment is used in this homework.
- Improvements to DQN:
- Double Q-Learning
- Dueling Network
- Prioritized Replay Memory
- Noisy DQN
- Distributional DQN
- Training Hint
- The action should act ε-greedily
- Random action with probability ε
- Also in testing
- Linearly decline ε from 1.0 to some small value, say 0.025
- Decline per step
- Randomness is for exploration, agent is weak at start
- Hyperparameters
- Replay Memory Size 10000
- Perform Update Current Network Step 4
- Perform Update Target Network Step 1000
- Learning Rate 1e-4
- Batch Size 32
- The action should act ε-greedily
- Getting averaging reward in 100 episodes over 40 in Breakout
- With OpenAI’s Atari wrapper & reward clipping
- We will unclip the reward when testing
$ python3.6 test.py --test_dqn
ep 0, reward: 1.000000
ep 1, reward: 8.000000
ep 2, reward: 8.000000
ep 3, reward: 361.000000
ep 4, reward: 7.000000
ep 5, reward: 10.000000
ep 6, reward: 111.000000
ep 7, reward: 4.000000
ep 8, reward: 195.000000
ep 9, reward: 54.000000
ep 10, reward: 0.000000
ep 11, reward: 310.000000
ep 12, reward: 8.000000
ep 13, reward: 0.000000
ep 14, reward: 0.000000
ep 15, reward: 8.000000
ep 16, reward: 65.000000
ep 17, reward: 51.000000
ep 18, reward: 238.000000
ep 19, reward: 0.000000
ep 20, reward: 8.000000
ep 21, reward: 65.000000
ep 22, reward: 51.000000
ep 23, reward: 238.000000
ep 24, reward: 0.000000
ep 25, reward: 8.000000
ep 26, reward: 65.000000
ep 27, reward: 51.000000
ep 28, reward: 238.000000
ep 29, reward: 0.000000
ep 30, reward: 1.000000
ep 31, reward: 8.000000
ep 32, reward: 8.000000
ep 33, reward: 361.000000
ep 34, reward: 7.000000
ep 35, reward: 1.000000
ep 36, reward: 8.000000
ep 37, reward: 8.000000
ep 38, reward: 361.000000
ep 39, reward: 7.000000
ep 40, reward: 8.000000
ep 41, reward: 65.000000
ep 42, reward: 51.000000
ep 43, reward: 238.000000
ep 44, reward: 0.000000
ep 45, reward: 10.000000
ep 46, reward: 111.000000
ep 47, reward: 4.000000
ep 48, reward: 195.000000
ep 49, reward: 54.000000
ep 50, reward: 1.000000
ep 51, reward: 8.000000
ep 52, reward: 8.000000
ep 53, reward: 361.000000
ep 54, reward: 7.000000
ep 55, reward: 0.000000
ep 56, reward: 310.000000
ep 57, reward: 8.000000
ep 58, reward: 0.000000
ep 59, reward: 0.000000
ep 60, reward: 10.000000
ep 61, reward: 111.000000
ep 62, reward: 4.000000
ep 63, reward: 195.000000
ep 64, reward: 54.000000
ep 65, reward: 8.000000
ep 66, reward: 65.000000
ep 67, reward: 51.000000
ep 68, reward: 238.000000
ep 69, reward: 0.000000
ep 70, reward: 8.000000
ep 71, reward: 65.000000
ep 72, reward: 51.000000
ep 73, reward: 238.000000
ep 74, reward: 0.000000
ep 75, reward: 10.000000
ep 76, reward: 111.000000
ep 77, reward: 4.000000
ep 78, reward: 195.000000
ep 79, reward: 54.000000
ep 80, reward: 8.000000
ep 81, reward: 65.000000
ep 82, reward: 51.000000
ep 83, reward: 238.000000
ep 84, reward: 0.000000
ep 85, reward: 10.000000
ep 86, reward: 111.000000
ep 87, reward: 4.000000
ep 88, reward: 195.000000
ep 89, reward: 54.000000
ep 90, reward: 8.000000
ep 91, reward: 65.000000
ep 92, reward: 51.000000
ep 93, reward: 238.000000
ep 94, reward: 0.000000
ep 95, reward: 10.000000
ep 96, reward: 111.000000
ep 97, reward: 4.000000
ep 98, reward: 195.000000
ep 99, reward: 54.000000
Run 100 episodes
Mean: 73.16
- Learning Curve of DQN
- Learning Curve of Dual DQN
- Learning Curve of Double DQN
- Learning Curve of Double Dual DQN
- Comparison of DQN, Dual DQN, Double DQN and Double Dual DQN
- Game Playing: Pong and Breakout
- Implement an agent to play Atari games using Actor-Critic.
- Improvements to Actor-Critic:
- DDPG (Deep Deterministic Policy Gradient)
- ACER (Sample Efficient Actor-Critic with Experience Replay)
- A3C (Asynchronous Advantage Actor-Critic)
- A2C (Synchronous Advantage Actor Critic)
- ACKTR (Actor Critic using Kronecker-Factored Trust Region)
- Learning Curve (Reward v.s. Episode) of Actor-Critic
- Learning Curve (Reward v.s. Episode) of A3C
- Comparison (Reward v.s. Episode) of Actor-Critic and A3C
- Learning Curve (Reward v.s. Time) of Actor-Critic
- Learning Curve (Reward v.s. Time) of A3C
- Comparison (Reward v.s. Time) of Actor-Critic and A3C
- Learning Curve (Reward v.s. Episode) of Actor-Critic
- Learning Curve (Reward v.s. Episode) of A3C
- Comparison (Reward v.s. Episode) of Actor-Critic and A3C
- Learning Curve (Reward v.s. Time) of Actor-Critic
- Learning Curve (Reward v.s. Time) of A3C
- Comparison (Reward v.s. Time) of Actor-Critic and A3C