-
Notifications
You must be signed in to change notification settings - Fork 639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement PPO-DNA algorithm for Atari #234
base: master
Are you sure you want to change the base?
Changes from all commits
dc7d1f2
a9e50b8
b0c4c45
76b2943
0c5230c
ee4d7a7
b440b4f
22a16e2
b2486cc
eb279a9
7550a5f
bbbdf2e
e20da82
f384cbb
3d1c5ba
7a66401
89c7f9d
981201f
0fe7b1f
e50bf9e
1b91a46
cde9ada
1bd2785
7bd5244
d2d6c11
87ca49d
88464bb
94fc331
e014b14
75a37c9
b76da96
710f85c
4fb584c
4b09c88
d259368
c071298
5d913a6
c052f44
f4501be
684001a
95bca3b
3e35b8a
b156f7d
f89e68d
6595b4c
52c9c55
caabea4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
# export WANDB_ENTITY=openrlbenchmark | ||
|
||
# comparison with PPO-DNA paper results on "Atari-5" envs | ||
poetry install -E envpool | ||
poetry run python -m cleanrl_utils.benchmark \ | ||
--env-ids BattleZone-v5 DoubleDunk-v5 NameThisGame-v5 Phoenix-v5 Qbert-v5 \ | ||
--command "poetry run python cleanrl/ppo_dna_atari_envpool.py --anneal-lr False --total-timesteps 50000000 --track" \ | ||
--num-seeds 3 \ | ||
--workers 1 | ||
|
||
# comparison with CleanRL ppo_atari_envpool.py | ||
poetry install -E envpool | ||
poetry run python -m cleanrl_utils.benchmark \ | ||
--env-ids Pong-v5 BeamRider-v5 Breakout-v5 Tennis-v5 \ | ||
--command "poetry run python cleanrl/ppo_dna_atari_envpool.py --track" \ | ||
--num-seeds 3 \ | ||
--workers 1 |
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
# Proximal Policy Gradient with Dual Network Architecture (PPO-DNA) | ||
|
||
## Overview | ||
|
||
PPO-DNA is a more sample efficient variant of PPO, based on using separate optimizers and hyperparameters for the actor (policy) and critic (value) networks. | ||
|
||
Original paper: | ||
|
||
* [DNA: Proximal Policy Optimization with a Dual Network Architecture](https://arxiv.org/abs/2206.10027) | ||
|
||
## Implemented Variants | ||
|
||
|
||
| Variants Implemented | Description | | ||
| ----------- | ----------- | | ||
| :material-github: [`ppo_dna_atari_envpool.py`](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_dna_atari_envpool.py), :material-file-document: [docs](/rl-algorithms/ppo_dna/#ppo_dna_atari_envpoolpy) | Uses the blazing fast Envpool Atari vectorized environment. | | ||
|
||
Below are our single-file implementations of PPO-DNA: | ||
|
||
## `ppo_dna_atari_envpool.py` | ||
|
||
The [ppo_dna_atari_envpool.py](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_dna_atari_envpool.py) has the following features: | ||
|
||
* Uses the blazing fast [Envpool](https://github.com/sail-sg/envpool) vectorized environment. | ||
* For Atari games. It uses convolutional layers and common atari-based pre-processing techniques. | ||
* Works with the Atari's pixel `Box` observation space of shape `(210, 160, 3)` | ||
* Works with the `Discrete` action space | ||
|
||
???+ warning | ||
|
||
Note that `ppo_dna_atari_envpool.py` does not work in Windows :fontawesome-brands-windows: and MacOs :fontawesome-brands-apple:. See envpool's built wheels here: [https://pypi.org/project/envpool/#files](https://pypi.org/project/envpool/#files) | ||
|
||
|
||
### Usage | ||
|
||
```bash | ||
poetry install -E envpool | ||
python cleanrl/ppo_dna_atari_envpool.py --help | ||
python cleanrl/ppo_dna_atari_envpool.py --env-id Breakout-v5 | ||
``` | ||
|
||
### Explanation of the logged metrics | ||
|
||
See [related docs](/rl-algorithms/ppo/#explanation-of-the-logged-metrics) for `ppo.py`. | ||
|
||
### Implementation details | ||
|
||
[ppo_dna_atari_envpool.py](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_dna_atari_envpool.py) uses a customized `RecordEpisodeStatistics` to work with envpool but has the same other implementation details as `ppo_atari.py` (see [related docs](/rl-algorithms/ppo/#implementation-details_1)). | ||
|
||
Note that the original DNA implementation uses the `StickyAction` environment pre-processing wrapper (see (Machado et al., 2018)[^1]), but we did not implement it in [ppo_dna_atari_envpool.py](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_dna_atari_envpool.py) because envpool for now does not support `StickyAction`. | ||
|
||
|
||
### Experiment results | ||
|
||
Below are the average episodic returns for `ppo_dna_atari_envpool.py` compared to `ppo_atari_envpool.py`. | ||
|
||
|
||
| Environment | `ppo_dna_atari_envpool.py` | `ppo_atari_envpool.py` | | ||
| ----------- | ----------- | ----------- | | ||
| BattleZone-v5 (40M steps) | 74000 ± 15300 | 28700 ± 6300 | | ||
| BeamRider-v5 (10M steps) | 5200 ± 900 | 1900 ± 530 | | ||
| Breakout-v5 (10M steps) | 319 ± 63 | 349 ± 42 | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suggest being consistent and always using 50M steps. It's ok to compare the All in all, I recommend using a single set of hyperparameters to run all experiments. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, though I'm using rented cloud GPU's so re-running has a direct cost to me. Also, I think there's already a couple of layers of evidence for PPO-DNA without running any more experiments:
Therefore, I'm not sure what the benefit from re-running With regards to hyperparameters, they are otherwise the same, except for total_timesteps. |
||
| DoubleDunk-v5 (40M steps) | -4.1 ± 1.0 | -2.0 ± 0.8 | | ||
| NameThisGame-v5 (40M steps) | 19100 ± 2300 | 4400 ± 1200 | | ||
| Phoenix-v5 (45M steps) | 186000 ± 67000 | 9900 ± 2700 | | ||
| Pong-v5 (3M steps) | 19.5 ± 1.0 | 16.6 ± 2.4 | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same comment as above. Please run using 50M steps. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What I could do instead is maybe show the results at the 10M steps mark, if 40M is a problem? |
||
| Qbert-v5 (45M steps) | 12800 ± 4200 | 11400 ± 3600 | | ||
| Tennis-v5 (10M steps) | 19.6 ± 0.0 | -12.4 ± 2.9 | | ||
|
||
Learning curves: | ||
|
||
<div class="grid-container"> | ||
<img src="../ppo_dna/BattleZone-v5-50m-steps.png"> | ||
<img src="../ppo_dna/BattleZone-v5-50m-time.png"> | ||
<img src="../ppo_dna/BeamRider-v5-10m-steps.png"> | ||
<img src="../ppo_dna/BeamRider-v5-10m-time.png"> | ||
<img src="../ppo_dna/Breakout-v5-10m-steps.png"> | ||
<img src="../ppo_dna/Breakout-v5-10m-time.png"> | ||
<img src="../ppo_dna/DoubleDunk-v5-50m-steps.png"> | ||
<img src="../ppo_dna/DoubleDunk-v5-50m-time.png"> | ||
<img src="../ppo_dna/NameThisGame-v5-50m-steps.png"> | ||
<img src="../ppo_dna/NameThisGame-v5-50m-time.png"> | ||
<img src="../ppo_dna/Phoenix-v5-50m-steps.png"> | ||
<img src="../ppo_dna/Phoenix-v5-50m-time.png"> | ||
<img src="../ppo_dna/Pong-v5-3m-steps.png"> | ||
<img src="../ppo_dna/Pong-v5-3m-time.png"> | ||
<img src="../ppo_dna/Qbert-v5-50m-steps.png"> | ||
<img src="../ppo_dna/Qbert-v5-50m-time.png"> | ||
<img src="../ppo_dna/Tennis-v5-10m-steps.png"> | ||
<img src="../ppo_dna/Tennis-v5-10m-time.png"> | ||
</div> | ||
|
||
|
||
Tracked experiments: | ||
|
||
<iframe src="https://wandb.ai/jseppanen/cleanrl/reports/PPO-DNA-vs-PPO-on-Atari-Envpool--VmlldzoyMzM5Mjcw" style="width:100%; height:500px" title="PPO-DNA vs PPO on Atari Envpool"></iframe> | ||
|
||
|
||
|
||
|
||
[^1]: Machado, Marlos C., Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. "Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents." Journal of Artificial Intelligence Research 61 (2018): 523-562. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean 50M steps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, my baseline
ppo_atari_envpool.py
experiments crashed before reaching full 50M steps, but I think the results are still valuable because they quite clearly show the performance difference between CleanRL baseline PPO and PPO-DNA.