Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement PPO-DNA algorithm for Atari #234

Open
wants to merge 47 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
dc7d1f2
First draft of PPO-DNA
jseppanen Jul 12, 2022
a9e50b8
Fix distillation learning rate decay
jseppanen Jul 12, 2022
b0c4c45
Add argument for envpool threads
jseppanen Jul 12, 2022
76b2943
Add exponential averaging to obs normalization
jseppanen Jul 12, 2022
0c5230c
Seed envpool environment explicitly
jseppanen Jul 12, 2022
ee4d7a7
Bump default number of environments to 128
jseppanen Jul 13, 2022
b440b4f
Log gradients & upload final model to w&b
jseppanen Jul 13, 2022
22a16e2
Fix wandb logging to follow --track option
jseppanen Jul 15, 2022
b2486cc
Log environment step count
jseppanen Jul 15, 2022
eb279a9
Remove unused --capture-video option
jseppanen Jul 15, 2022
7550a5f
Fix distillation batch size argument
jseppanen Jul 19, 2022
bbbdf2e
Fix deprecation warning on np.bool_
jseppanen Jul 19, 2022
e20da82
Use correct frame skip from env
jseppanen Jul 19, 2022
f384cbb
Add docs for PPO-DNA
jseppanen Jul 19, 2022
3d1c5ba
Blacken
jseppanen Jul 19, 2022
7a66401
build docs
vwxyzjn Jul 19, 2022
89c7f9d
format table
vwxyzjn Jul 20, 2022
981201f
Add a note on environment preprocessing
vwxyzjn Jul 20, 2022
0fe7b1f
Update hyperparam defaults to match paper
jseppanen Jul 20, 2022
e50bf9e
Change order of learning stages
jseppanen Jul 20, 2022
1b91a46
Revert entropy coefficient back to 0.01
jseppanen Jul 20, 2022
cde9ada
Replace DIY obs. normalization with gym wrapper
jseppanen Jul 20, 2022
1bd2785
Disable TF32 multiplication on Ampere devices
jseppanen Jul 20, 2022
7bd5244
Remove reward clipping & add reward normalization
jseppanen Jul 20, 2022
d2d6c11
First pass to remove differences from baseline PPO
jseppanen Jul 21, 2022
87ca49d
Revert "Change order of learning stages"
jseppanen Jul 21, 2022
88464bb
Revert "Update hyperparam defaults to match paper"
jseppanen Jul 21, 2022
94fc331
Minimize differences to baseline PPO code
jseppanen Jul 21, 2022
e014b14
Re-run experiments with code from commit 94fc331
jseppanen Jul 31, 2022
75a37c9
Remove main() function
jseppanen Jul 31, 2022
b76da96
Merge branch 'master' into ppo-dna
vwxyzjn Aug 1, 2022
710f85c
minor refactor
vwxyzjn Aug 1, 2022
4fb584c
Add benchmark script for ppo_dna_atari_envpool.py
jseppanen Sep 25, 2022
4b09c88
Fix typo
jseppanen Sep 25, 2022
d259368
Run value network in eval mode in distillation
jseppanen Sep 27, 2022
c071298
Do rollouts in eval mode
jseppanen Sep 28, 2022
5d913a6
Merge branch 'master' into ppo-dna
vwxyzjn Nov 20, 2022
c052f44
Remove duplicate adv calculation (see #287)
vwxyzjn Nov 20, 2022
f4501be
remove `OMP_NUM_THREADS=1`
vwxyzjn Nov 20, 2022
684001a
Merge branch 'master' into ppo-dna
vwxyzjn Nov 20, 2022
95bca3b
push changes
vwxyzjn Nov 20, 2022
3e35b8a
Try matching the env initializion in the paper
vwxyzjn Nov 20, 2022
b156f7d
revert change
vwxyzjn Nov 20, 2022
f89e68d
revert changes
vwxyzjn Nov 20, 2022
6595b4c
bug fix
vwxyzjn Nov 22, 2022
52c9c55
update script
vwxyzjn Nov 22, 2022
caabea4
Merge branch 'master' into ppo-dna
vwxyzjn Jan 12, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions benchmark/ppo_dna.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# export WANDB_ENTITY=openrlbenchmark

# comparison with PPO-DNA paper results on "Atari-5" envs
poetry install -E envpool
poetry run python -m cleanrl_utils.benchmark \
--env-ids BattleZone-v5 DoubleDunk-v5 NameThisGame-v5 Phoenix-v5 Qbert-v5 \
--command "poetry run python cleanrl/ppo_dna_atari_envpool.py --anneal-lr False --total-timesteps 50000000 --track" \
--num-seeds 3 \
--workers 1

# comparison with CleanRL ppo_atari_envpool.py
poetry install -E envpool
poetry run python -m cleanrl_utils.benchmark \
--env-ids Pong-v5 BeamRider-v5 Breakout-v5 Tennis-v5 \
--command "poetry run python cleanrl/ppo_dna_atari_envpool.py --track" \
--num-seeds 3 \
--workers 1
2 changes: 0 additions & 2 deletions cleanrl/ppo_atari_envpool.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,6 @@ def parse_args():
help="the wandb's project name")
parser.add_argument("--wandb-entity", type=str, default=None,
help="the entity (team) of wandb's project")
parser.add_argument("--capture-video", type=lambda x: bool(strtobool(x)), default=False, nargs="?", const=True,
help="whether to capture videos of the agent performances (check out `videos` folder)")

# Algorithm specific arguments
parser.add_argument("--env-id", type=str, default="Pong-v5",
Expand Down
423 changes: 423 additions & 0 deletions cleanrl/ppo_dna_atari_envpool.py

Large diffs are not rendered by default.

101 changes: 101 additions & 0 deletions docs/rl-algorithms/ppo_dna.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Proximal Policy Gradient with Dual Network Architecture (PPO-DNA)

## Overview

PPO-DNA is a more sample efficient variant of PPO, based on using separate optimizers and hyperparameters for the actor (policy) and critic (value) networks.

Original paper:

* [DNA: Proximal Policy Optimization with a Dual Network Architecture](https://arxiv.org/abs/2206.10027)

## Implemented Variants


| Variants Implemented | Description |
| ----------- | ----------- |
| :material-github: [`ppo_dna_atari_envpool.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_dna_atari_envpool.py), :material-file-document: [docs](/rl-algorithms/ppo_dna/#ppo_dna_atari_envpoolpy) | Uses the blazing fast Envpool Atari vectorized environment. |

Below are our single-file implementations of PPO-DNA:

## `ppo_dna_atari_envpool.py`

The [ppo_dna_atari_envpool.py](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_dna_atari_envpool.py) has the following features:

* Uses the blazing fast [Envpool](https://github.com/sail-sg/envpool) vectorized environment.
* For Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
* Works with the Atari's pixel `Box` observation space of shape `(210, 160, 3)`
* Works with the `Discrete` action space

???+ warning

Note that `ppo_dna_atari_envpool.py` does not work in Windows :fontawesome-brands-windows: and MacOs :fontawesome-brands-apple:. See envpool's built wheels here: [https://pypi.org/project/envpool/#files](https://pypi.org/project/envpool/#files)


### Usage

```bash
poetry install -E envpool
python cleanrl/ppo_dna_atari_envpool.py --help
python cleanrl/ppo_dna_atari_envpool.py --env-id Breakout-v5
```

### Explanation of the logged metrics

See [related docs](/rl-algorithms/ppo/#explanation-of-the-logged-metrics) for `ppo.py`.

### Implementation details

[ppo_dna_atari_envpool.py](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_dna_atari_envpool.py) uses a customized `RecordEpisodeStatistics` to work with envpool but has the same other implementation details as `ppo_atari.py` (see [related docs](/rl-algorithms/ppo/#implementation-details_1)).

Note that the original DNA implementation uses the `StickyAction` environment pre-processing wrapper (see (Machado et al., 2018)[^1]), but we did not implement it in [ppo_dna_atari_envpool.py](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_dna_atari_envpool.py) because envpool for now does not support `StickyAction`.


### Experiment results

Below are the average episodic returns for `ppo_dna_atari_envpool.py` compared to `ppo_atari_envpool.py`.


| Environment | `ppo_dna_atari_envpool.py` | `ppo_atari_envpool.py` |
| ----------- | ----------- | ----------- |
| BattleZone-v5 (40M steps) | 74000 ± 15300 | 28700 ± 6300 |
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean 50M steps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, my baseline ppo_atari_envpool.py experiments crashed before reaching full 50M steps, but I think the results are still valuable because they quite clearly show the performance difference between CleanRL baseline PPO and PPO-DNA.

| BeamRider-v5 (10M steps) | 5200 ± 900 | 1900 ± 530 |
| Breakout-v5 (10M steps) | 319 ± 63 | 349 ± 42 |
Copy link
Owner

@vwxyzjn vwxyzjn Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest being consistent and always using 50M steps. It's ok to compare the ppo_atari_envpool.py that only has 10M steps. If you think comparing 50M steps is warranted, please start a ppo_atari_envpool_50M wandb project in the openrlbenchmark entity and run ppo_atari_envpool.py in 50M steps there.

All in all, I recommend using a single set of hyperparameters to run all experiments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, though I'm using rented cloud GPU's so re-running has a direct cost to me.

Also, I think there's already a couple of layers of evidence for PPO-DNA without running any more experiments:

  1. Primarily, I was interested in reproducing the paper results on the five Atari envs shown in Fig 6 in the paper. The paper used 50M steps for their experiments, so that's what I did as well. My results are not exactly the same, but in the same ballpark (one env performs better, two are comparable, and two are worse).
  2. Secondarily, I wanted to do an apples-to-apples comparison against the baseline ppo_atari_envpool.py implementation, because that way the environment settings and other implementation details wouldn't distort the results. The CleanRL documentation has learning curves for Breakout, Pong, and BeamRider for 10M steps, so that's what I ran (plus I noticed some Tennis runs in the W&B project). I attempted to run the comparisons for 50M steps as well, but the baseline runs crashed (due to my cloud account running dry), so I thought 40M steps is already more than enough.

Therefore, I'm not sure what the benefit from re-running ppo_atari_envpool.py for 50M steps would be?

With regards to hyperparameters, they are otherwise the same, except for total_timesteps.

| DoubleDunk-v5 (40M steps) | -4.1 ± 1.0 | -2.0 ± 0.8 |
| NameThisGame-v5 (40M steps) | 19100 ± 2300 | 4400 ± 1200 |
| Phoenix-v5 (45M steps) | 186000 ± 67000 | 9900 ± 2700 |
| Pong-v5 (3M steps) | 19.5 ± 1.0 | 16.6 ± 2.4 |
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above. Please run using 50M steps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I could do instead is maybe show the results at the 10M steps mark, if 40M is a problem?

| Qbert-v5 (45M steps) | 12800 ± 4200 | 11400 ± 3600 |
| Tennis-v5 (10M steps) | 19.6 ± 0.0 | -12.4 ± 2.9 |

Learning curves:

<div class="grid-container">
<img src="../ppo_dna/BattleZone-v5-50m-steps.png">
<img src="../ppo_dna/BattleZone-v5-50m-time.png">
<img src="../ppo_dna/BeamRider-v5-10m-steps.png">
<img src="../ppo_dna/BeamRider-v5-10m-time.png">
<img src="../ppo_dna/Breakout-v5-10m-steps.png">
<img src="../ppo_dna/Breakout-v5-10m-time.png">
<img src="../ppo_dna/DoubleDunk-v5-50m-steps.png">
<img src="../ppo_dna/DoubleDunk-v5-50m-time.png">
<img src="../ppo_dna/NameThisGame-v5-50m-steps.png">
<img src="../ppo_dna/NameThisGame-v5-50m-time.png">
<img src="../ppo_dna/Phoenix-v5-50m-steps.png">
<img src="../ppo_dna/Phoenix-v5-50m-time.png">
<img src="../ppo_dna/Pong-v5-3m-steps.png">
<img src="../ppo_dna/Pong-v5-3m-time.png">
<img src="../ppo_dna/Qbert-v5-50m-steps.png">
<img src="../ppo_dna/Qbert-v5-50m-time.png">
<img src="../ppo_dna/Tennis-v5-10m-steps.png">
<img src="../ppo_dna/Tennis-v5-10m-time.png">
</div>


Tracked experiments:

<iframe src="https://wandb.ai/jseppanen/cleanrl/reports/PPO-DNA-vs-PPO-on-Atari-Envpool--VmlldzoyMzM5Mjcw" style="width:100%; height:500px" title="PPO-DNA vs PPO on Atari Envpool"></iframe>




[^1]: Machado, Marlos C., Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. "Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents." Journal of Artificial Intelligence Research 61 (2018): 523-562.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/rl-algorithms/ppo_dna/Pong-v5-3m-steps.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/rl-algorithms/ppo_dna/Pong-v5-3m-time.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/rl-algorithms/ppo_dna/Qbert-v5-50m-time.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ nav:
- rl-algorithms/ppg.md
- rl-algorithms/ppo-rnd.md
- rl-algorithms/rpo.md
- rl-algorithms/ppo_dna.md
- Advanced:
- advanced/hyperparameter-tuning.md
- advanced/resume-training.md
Expand Down