-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rllib] Basic IMPALA implementation (using deepmind's reference vtrace.py) #2504
Conversation
@@ -0,0 +1,189 @@ | |||
"""This is an variant of A3CPolicyGraph that uses V-trace for loss calc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adapted from a3c_policy_graph.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"an variant" -> "a variant"
@@ -0,0 +1,300 @@ | |||
# Copyright 2018 Google LLC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code inlined verbatim
@@ -0,0 +1,295 @@ | |||
"""Implements Distributed Prioritized Experience Replay. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from async_sample_optimizer
Test FAILed. |
Test PASSed. |
@@ -1,108 +1,28 @@ | |||
"""Implements Distributed Prioritized Experience Replay. | |||
"""Implements the IMPALA architecture. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adapted from #2147
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test PASSed. |
Test FAILed. |
Test FAILed. |
Is the 16 workers A3C? Did you try A3C with vectorized envs too? |
No, this is all impala. A3C takes hours to get to the same point.
…On Mon, Jul 30, 2018, 10:01 PM Richard Liaw ***@***.***> wrote:
Is the 16 workers A3C? Did you try A3C with vectorized envs too?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2504 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAA6Sn-QHRFSByhV4UHUxf-jo106qUzMks5uL-SUgaJpZM4VlS7j>
.
|
Test PASSed. |
Test PASSed. |
Test PASSed. |
tf.float32)) | ||
|
||
# The policy gradients loss | ||
self.pi_loss = -tf.reduce_sum( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should be reduce_mean
now that concatenation is removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deepmind impl seems to use reduce_sum; we can either keep it or change it together with a3c.
|
||
.. code-block:: bash | ||
|
||
python ray/python/ray/rllib/train.py -f /path/to/tuned/example.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rllib train -f ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this works if you don't pip install, so going to keep it for now
Test FAILed. |
Test PASSed. |
Test FAILed. |
Test FAILed. |
Everything else looks good, merging after yapf. |
What do these changes do?
PongNoFrameskip-v4 on IMPALA scaling from 16 to 128 workers, solving Pong in <10 min. For reference, solving this env takes ~40 minutes for Ape-X and several hours for A3C.
cc @joneswong
Related issue number
#1924