-
Notifications
You must be signed in to change notification settings - Fork 639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement PPO-DNA algorithm for Atari #234
base: master
Are you sure you want to change the base?
Conversation
* improves performance * matches with DNA paper
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
@maitchison has expressed interest in helping review this PR. Thank you, Matthew! I will also try to read the paper and add some comments. |
Small thing
should be
The algorithm will make many more than 50M gradient updates due to the number of mini-batches. |
@vwxyzjn sure, I added benchmarks/ppo_dna.sh. Also, maybe I haven't communicated my experiment results clearly enough, I could also try and consolidate them into one place. |
Disable dropout etc. in teacher network during distillation
Disable dropout etc. during rollouts
@jseppanen wondering if you're still interested in this PR. I took a quick look at your wandb and see runs in your account for Phoenix-v5, NamesThisGame-v5, DoubleDunk-v5 at 50M steps, just not Pong. Trying to understand where we ended up here. Was it that you were reporting results at different step counts despite having 50M step runs completed? Or that we just need ppo_envpool 50M results for comparison? Just want to see if I can provide any help or guidance. |
Hi @vwxyzjn, I now have some time I can put into this and would be happy to finish off the last few things that need doing. Looks like @jseppanen has got it mostly there, so it shouldn't take too long. I have access to a cluster I can run any additional experiments if needed. |
Hey, sorry folks for not replying sooner. I looked into the PR a bit more and looks like @jseppanen has already addressed my comments on 50M steps. Thanks a lot. I generated the following plots using https://github.com/vwxyzjn/ppo-atari-metrics/blob/main/rlops.py
The results look good. I will go ahead and run the Pong-v5 experiments to match the 10M steps, and that should be all for all of the experiments. @jseppanen would you mind moving the runs from your entity to Screen.Recording.2022-11-19.at.9.17.22.PM.movA note on environment preprocessingThe preprocessing steps should be like below, but envs = envpool.make(
args.env_id,
env_type="gym",
num_envs=args.num_envs,
episodic_life=False, # Machado et al. 2017 (Revisitng ALE: Eval protocals) p. 6
repeat_action_probability=0.25, # Machado et al. 2017 (Revisitng ALE: Eval protocals) p. 12
noop_max=1, # Machado et al. 2017 (Revisitng ALE: Eval protocals) p. 12 (no-op is deprecated in favor of sticky action)
max_episode_steps=int(108000 / 4), # Hessel et al. 2018 (Rainbow DQN), Table 3, Max frames per episode
reward_clip=True,
seed=args.seed,
# full_action_space=True, # currently not supported by EnvPool Machado et al. 2017 (Revisitng ALE: Eval protocals) Table 5
) |
Description
Add implementation of PPO-DNA algorithm for Atari Envpool.
Paper reproduction (attempt)
Here's the episodic rewards after 200M environment steps (50M environment interactions before frame skip), compared to Fig. 6 in the original paper:
However, I used the default networks and environments from the CleanRL PPO Atari implementation, so there are probably differences in them, vs. the original paper. In summary, when comparing against paper results, this implementation gets better returns in one task, comparable returns in two tasks, and worse returns in two out of the five tasks.
Results from figure 6 in the paper:
When comparing against ClearRL PPO Atari Envpool implementation, this implementation performs better on six out of nine tasks. See detailed learning curves, compared against CleanRL PPO Atari Envpool implementation:
PPO-DNA vs PPO on Atari Envpool
Reference
cc @maitchison
Types of changes
Checklist:
pre-commit run --all-files
passes (required).If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See #137 as an example PR.
--capture-video
flag toggled on (required).mkdocs serve
.width=500
andheight=300
).benchmark
folder like benchmark/ppo.shtorch.backends.cuda.matmul.allow_tf32 = False
@vwxyzjn