Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement PPO-DNA algorithm for Atari #234

Open
wants to merge 47 commits into
base: master
Choose a base branch
from

Conversation

jseppanen
Copy link
Contributor

@jseppanen jseppanen commented Jul 19, 2022

Description

Add implementation of PPO-DNA algorithm for Atari Envpool.

Paper reproduction (attempt)

Here's the episodic rewards after 200M environment steps (50M environment interactions before frame skip), compared to Fig. 6 in the original paper:

  • BattleZone: 82 000 ± 19 000 (matches with about 60 000 in the paper)
  • DoubleDunk: -3.5 ± 1.1 (worse than about 1.0 in the paper)
  • NameThisGame: 21 700 ± 2 500 (matches with about 20 000 in the paper)
  • Phoenix: 225 000 ± 76 000 (better than about 80 000 in the paper)
  • Qbert: 12 000 ± 5 000 (worse than about 30 000 in the paper)

However, I used the default networks and environments from the CleanRL PPO Atari implementation, so there are probably differences in them, vs. the original paper. In summary, when comparing against paper results, this implementation gets better returns in one task, comparable returns in two tasks, and worse returns in two out of the five tasks.

Results from figure 6 in the paper:
figure6

When comparing against ClearRL PPO Atari Envpool implementation, this implementation performs better on six out of nine tasks. See detailed learning curves, compared against CleanRL PPO Atari Envpool implementation:
PPO-DNA vs PPO on Atari Envpool

Reference

cc @maitchison

Types of changes

  • Bug fix
  • New feature
  • New algorithm
  • Documentation

Checklist:

  • I've read the CONTRIBUTION guide (required).
  • I have ensured pre-commit run --all-files passes (required).
  • I have updated the documentation.
  • I have updated the tests accordingly (if applicable).

If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See #137 as an example PR.

  • I have contacted vwxyzjn to obtain access to the openrlbenchmark W&B team (required).
  • I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).
  • I have added additional documentation and previewed the changes via mkdocs serve.
    • I have explained note-worthy implementation details.
    • I have explained the logged metrics.
    • I have added links to the original paper and related papers (if applicable).
    • I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
    • I have added the learning curves (in PNG format with width=500 and height=300).
    • I have added links to the tracked experiments.
    • I have updated the overview sections at the docs and the repo
    • I have added the commands used to run experiments in the benchmark folder like benchmark/ppo.sh
  • I have updated the tests accordingly (if applicable).
  • determine if torch.backends.cuda.matmul.allow_tf32 = False @vwxyzjn

@vercel
Copy link

vercel bot commented Jul 19, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
cleanrl ✅ Ready (Inspect) Visit Preview Jan 12, 2023 at 4:45PM (UTC)

@vwxyzjn
Copy link
Owner

vwxyzjn commented Jul 20, 2022

@maitchison has expressed interest in helping review this PR. Thank you, Matthew! I will also try to read the paper and add some comments.

@maitchison
Copy link

Small thing

Here's the episodic rewards after 200M environment steps (50M gradient updates), compared to Fig. 6 in the original paper:

should be

Here's the episodic rewards after 200M environment steps (50M environment interactions), compared to Fig. 6 in the original paper:

The algorithm will make many more than 50M gradient updates due to the number of mini-batches.

@jseppanen
Copy link
Contributor Author

@vwxyzjn sure, I added benchmarks/ppo_dna.sh. Also, maybe I haven't communicated my experiment results clearly enough, I could also try and consolidate them into one place.

Disable dropout etc. in teacher network during distillation
Disable dropout etc. during rollouts
@bragajj
Copy link
Contributor

bragajj commented Nov 2, 2022

@jseppanen wondering if you're still interested in this PR. I took a quick look at your wandb and see runs in your account for Phoenix-v5, NamesThisGame-v5, DoubleDunk-v5 at 50M steps, just not Pong. Trying to understand where we ended up here.

Was it that you were reporting results at different step counts despite having 50M step runs completed? Or that we just need ppo_envpool 50M results for comparison? Just want to see if I can provide any help or guidance.

@maitchison
Copy link

Hi @vwxyzjn, I now have some time I can put into this and would be happy to finish off the last few things that need doing. Looks like @jseppanen has got it mostly there, so it shouldn't take too long. I have access to a cluster I can run any additional experiments if needed.

@vwxyzjn
Copy link
Owner

vwxyzjn commented Nov 20, 2022

Hey, sorry folks for not replying sooner. I looked into the PR a bit more and looks like @jseppanen has already addressed my comments on 50M steps. Thanks a lot. I generated the following plots using https://github.com/vwxyzjn/ppo-atari-metrics/blob/main/rlops.py

python rlops.py --wandb-project-name envpool-atari \
    --wandb-entity openrlbenchmark \
    --filters 'ppo_dna_atari_envpool_94fc331?wpn=cleanrl&we=jseppanen' 'ppo_atari_envpool_xla_jax_truncation?metric=charts/avg_episodic_return'   \
    --env-ids BattleZone-v5 DoubleDunk-v5 NameThisGame-v5 Phoenix-v5 Qbert-v5 Pong-v5 BeamRider-v5 Breakout-v5 Tennis-v5 \
    --output-filename compare.png --scan-history

image

The results look good. I will go ahead and run the Pong-v5 experiments to match the 10M steps, and that should be all for all of the experiments. @jseppanen would you mind moving the runs from your entity to openrlbenchmark? You can move them as shown in the video below:

Screen.Recording.2022-11-19.at.9.17.22.PM.mov

A note on environment preprocessing

The preprocessing steps should be like below, but full_action_space=True is not currently supported by EnvPool (sail-sg/envpool#220). Let's put this note in the documentation and not block this PR any longer.

envs = envpool.make(
    args.env_id,
    env_type="gym",
    num_envs=args.num_envs,
    episodic_life=False,  # Machado et al. 2017 (Revisitng ALE: Eval protocals) p. 6
    repeat_action_probability=0.25,  # Machado et al. 2017 (Revisitng ALE: Eval protocals) p. 12
    noop_max=1,  # Machado et al. 2017 (Revisitng ALE: Eval protocals) p. 12 (no-op is deprecated in favor of sticky action)
    max_episode_steps=int(108000 / 4),  # Hessel et al. 2018 (Rainbow DQN), Table 3, Max frames per episode
    reward_clip=True,
    seed=args.seed,
    # full_action_space=True, # currently not supported by EnvPool Machado et al. 2017 (Revisitng ALE: Eval protocals) Table 5
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants