Name		Name	Last commit message	Last commit date
parent directory ..
agent_dir		agent_dir
checkpoints		checkpoints
results		results
README.md		README.md
argument.py		argument.py
atari_wrapper.py		atari_wrapper.py
environment.py		environment.py
main.py		main.py
report.pdf		report.pdf
test.py		test.py

README.md

MLDS2018SPRING/hw4

4-0. Requirements

atari-py==0.1.1
gym==0.10.5
matplotlib==2.2.2
numpy==1.14.3
opencv-python==3.4.0.12
scipy==1.1.0
tensorflow-gpu==1.6.0

4-1. Policy Gradient

Introduction

Game Playing: Pong
Implement an agent to play Atari games using Deep Reinforcement Learning.
In this homework, you are required to implement Policy Gradient.
The Pong environment is used in this homework.
Improvements to Policy Gradient:
- Variance Reduction
- Natural Policy Gradient
- Trust Region Policy Optimization
- Proximal Policy Optimization
Training Hint
- Reward normalization (More stable)
- Action space reduction (Only up and down)

Baseline

Getting averaging reward in 30 episodes over 3 in Pong
Without OpenAI’s Atari wrapper & reward clipping
Improvements to Policy Gradient are allowed

Testing Policy Gradient

$ python3.6 test.py --test_pg

Rewards in 30 Episodes

ep 0, reward: 18.000000
ep 1, reward: 17.000000
ep 2, reward: 18.000000
ep 3, reward: 19.000000
ep 4, reward: 13.000000
ep 5, reward: 17.000000
ep 6, reward: 9.000000
ep 7, reward: 20.000000
ep 8, reward: 16.000000
ep 9, reward: 17.000000
ep 10, reward: 12.000000
ep 11, reward: 19.000000
ep 12, reward: 16.000000
ep 13, reward: 18.000000
ep 14, reward: 18.000000
ep 15, reward: 19.000000
ep 16, reward: 15.000000
ep 17, reward: 12.000000
ep 18, reward: 15.000000
ep 19, reward: 21.000000
ep 20, reward: 17.000000
ep 21, reward: 18.000000
ep 22, reward: 13.000000
ep 23, reward: 18.000000
ep 24, reward: 19.000000
ep 25, reward: 20.000000
ep 26, reward: 17.000000
ep 27, reward: 15.000000
ep 28, reward: 14.000000
ep 29, reward: 14.000000
Run 30 episodes
Mean: 16.466666666666665

Learning Curve

Learning Curve of Original Policy Gradient

Learning Curve of Policy Gradient with Proximal Policy Optimization (PPO)

Comparison of Original PG and PG with PPO

4-2. Deep Q Learning

Introduction

Game Playing: Breakout
Implement an agent to play Atari games using Deep Reinforcement Learning.
In this homework, you are required to implement Deep Q-Learning (DQN).
The Breakout environment is used in this homework.
Improvements to DQN:
- Double Q-Learning
- Dueling Network
- Prioritized Replay Memory
- Noisy DQN
- Distributional DQN
Training Hint
- The action should act ε-greedily
  - Random action with probability ε
  - Also in testing
- Linearly decline ε from 1.0 to some small value, say 0.025
  - Decline per step
  - Randomness is for exploration, agent is weak at start
- Hyperparameters
  - Replay Memory Size 10000
  - Perform Update Current Network Step 4
  - Perform Update Target Network Step 1000
  - Learning Rate 1e-4
  - Batch Size 32

Baseline

Getting averaging reward in 100 episodes over 40 in Breakout
With OpenAI’s Atari wrapper & reward clipping
- We will unclip the reward when testing

Testing Deep Q Learning

$ python3.6 test.py --test_dqn

Rewards in 100 Episodes

ep 0, reward: 1.000000
ep 1, reward: 8.000000
ep 2, reward: 8.000000
ep 3, reward: 361.000000
ep 4, reward: 7.000000
ep 5, reward: 10.000000
ep 6, reward: 111.000000
ep 7, reward: 4.000000
ep 8, reward: 195.000000
ep 9, reward: 54.000000
ep 10, reward: 0.000000
ep 11, reward: 310.000000
ep 12, reward: 8.000000
ep 13, reward: 0.000000
ep 14, reward: 0.000000
ep 15, reward: 8.000000
ep 16, reward: 65.000000
ep 17, reward: 51.000000
ep 18, reward: 238.000000
ep 19, reward: 0.000000
ep 20, reward: 8.000000
ep 21, reward: 65.000000
ep 22, reward: 51.000000
ep 23, reward: 238.000000
ep 24, reward: 0.000000
ep 25, reward: 8.000000
ep 26, reward: 65.000000
ep 27, reward: 51.000000
ep 28, reward: 238.000000
ep 29, reward: 0.000000
ep 30, reward: 1.000000
ep 31, reward: 8.000000
ep 32, reward: 8.000000
ep 33, reward: 361.000000
ep 34, reward: 7.000000
ep 35, reward: 1.000000
ep 36, reward: 8.000000
ep 37, reward: 8.000000
ep 38, reward: 361.000000
ep 39, reward: 7.000000
ep 40, reward: 8.000000
ep 41, reward: 65.000000
ep 42, reward: 51.000000
ep 43, reward: 238.000000
ep 44, reward: 0.000000
ep 45, reward: 10.000000
ep 46, reward: 111.000000
ep 47, reward: 4.000000
ep 48, reward: 195.000000
ep 49, reward: 54.000000
ep 50, reward: 1.000000
ep 51, reward: 8.000000
ep 52, reward: 8.000000
ep 53, reward: 361.000000
ep 54, reward: 7.000000
ep 55, reward: 0.000000
ep 56, reward: 310.000000
ep 57, reward: 8.000000
ep 58, reward: 0.000000
ep 59, reward: 0.000000
ep 60, reward: 10.000000
ep 61, reward: 111.000000
ep 62, reward: 4.000000
ep 63, reward: 195.000000
ep 64, reward: 54.000000
ep 65, reward: 8.000000
ep 66, reward: 65.000000
ep 67, reward: 51.000000
ep 68, reward: 238.000000
ep 69, reward: 0.000000
ep 70, reward: 8.000000
ep 71, reward: 65.000000
ep 72, reward: 51.000000
ep 73, reward: 238.000000
ep 74, reward: 0.000000
ep 75, reward: 10.000000
ep 76, reward: 111.000000
ep 77, reward: 4.000000
ep 78, reward: 195.000000
ep 79, reward: 54.000000
ep 80, reward: 8.000000
ep 81, reward: 65.000000
ep 82, reward: 51.000000
ep 83, reward: 238.000000
ep 84, reward: 0.000000
ep 85, reward: 10.000000
ep 86, reward: 111.000000
ep 87, reward: 4.000000
ep 88, reward: 195.000000
ep 89, reward: 54.000000
ep 90, reward: 8.000000
ep 91, reward: 65.000000
ep 92, reward: 51.000000
ep 93, reward: 238.000000
ep 94, reward: 0.000000
ep 95, reward: 10.000000
ep 96, reward: 111.000000
ep 97, reward: 4.000000
ep 98, reward: 195.000000
ep 99, reward: 54.000000
Run 100 episodes
Mean: 73.16

Learning Curve

Learning Curve of DQN

Learning Curve of Dual DQN

Learning Curve of Double DQN

Learning Curve of Double Dual DQN

Comparison of DQN, Dual DQN, Double DQN and Double Dual DQN

4-3. Actor-Critic

Introduction

Game Playing: Pong and Breakout
Implement an agent to play Atari games using Actor-Critic.
Improvements to Actor-Critic:
- DDPG (Deep Deterministic Policy Gradient)
- ACER (Sample Efficient Actor-Critic with Experience Replay)
- A3C (Asynchronous Advantage Actor-Critic)
- A2C (Synchronous Advantage Actor Critic)
- ACKTR (Actor Critic using Kronecker-Factored Trust Region)

Learning Curve of Actor-Critic and A3C on Pong

Learning Curve (Reward v.s. Episode) of Actor-Critic

Learning Curve (Reward v.s. Episode) of A3C

Comparison (Reward v.s. Episode) of Actor-Critic and A3C

Learning Curve (Reward v.s. Time) of Actor-Critic

Learning Curve (Reward v.s. Time) of A3C

Comparison (Reward v.s. Time) of Actor-Critic and A3C

Learning Curve of Actor-Critic and A3C on Breakout

Learning Curve (Reward v.s. Episode) of Actor-Critic

Learning Curve (Reward v.s. Episode) of A3C

Comparison (Reward v.s. Episode) of Actor-Critic and A3C

Learning Curve (Reward v.s. Time) of Actor-Critic

Learning Curve (Reward v.s. Time) of A3C

Comparison (Reward v.s. Time) of Actor-Critic and A3C

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hw4

hw4

README.md

MLDS2018SPRING/hw4

Table of Contents

4-0. Requirements

4-1. Policy Gradient

Introduction

Baseline

Testing Policy Gradient

Rewards in 30 Episodes

Learning Curve

4-2. Deep Q Learning

Introduction

Baseline

Testing Deep Q Learning

Rewards in 100 Episodes

Learning Curve

4-3. Actor-Critic

Introduction

Learning Curve of Actor-Critic and A3C on Pong

Learning Curve of Actor-Critic and A3C on Breakout

Files

hw4

Directory actions

More options

Directory actions

More options

Latest commit

History

hw4

Folders and files

parent directory

README.md

MLDS2018SPRING/hw4

Table of Contents

4-0. Requirements

4-1. Policy Gradient

Introduction

Baseline

Testing Policy Gradient

Rewards in 30 Episodes

Learning Curve

4-2. Deep Q Learning

Introduction

Baseline

Testing Deep Q Learning

Rewards in 100 Episodes

Learning Curve

4-3. Actor-Critic

Introduction

Learning Curve of Actor-Critic and A3C on Pong

Learning Curve of Actor-Critic and A3C on Breakout