-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rllib] Pull out multi-gpu optimizer as a generic class #1313
Conversation
Merged build finished. Test FAILed. |
Test FAILed. |
Merged build finished. Test FAILed. |
Test FAILed. |
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test FAILed. |
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments; this introduces some nice abstractions.
def concat(self, other): | ||
"""Returns a new SampleBatch with each data column concatenated. | ||
|
||
>>> b1 = SampleBatch({"a": [1, 2]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you label this with Examples:
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
"""Wrapper around a dictionary with string keys and array-like values. | ||
|
||
For example, {"obs": [1, 2, 3], "reward": [0, -1, 1]} is a batch of three | ||
samples, each with an "obs" and "reward" attribute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should Shuffle also be a method here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm possibly. Right now multi gpu optimizer does a more efficient shuffling where it only permutes indices, which seems just as good.
@@ -55,20 +55,23 @@ def update_target(self): | |||
self.dqn_graph.update_target(self.sess) | |||
|
|||
def sample(self): | |||
output = [] | |||
obs, actions, rewards, new_obs, dones = [], [], [], [], [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to replace this with a SyncSampler
? A little modification would need to be made to support exploration, but this class will end up cleaner. Can be done in another PR I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I think it should work.
import numpy as np | ||
|
||
|
||
class SampleBatch(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is a nice abstraction - perhaps it makes sense to extend PartialRollout from this which provides the ability to support truncated rollouts. If so, we can do it in a separate PR.
for row in s.rows(): | ||
self.replay_buffer.add( | ||
row["obs"], row["actions"], row["rewards"], row["new_obs"], | ||
row["dones"]) | ||
|
||
if no_replay: | ||
return samples | ||
|
||
# Then return a batch sampled from the buffer | ||
if self.config["prioritized_replay"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope later on all the ReplayBuffer casework can get pushed into the ReplayBuffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's difficult. The prioritization requires extra computation that has to be part of the loss.
self._update_priorities_if_needed() | ||
self.samples_to_prioritize = ( | ||
obses_t, actions, rewards, obses_tp1, dones, batch_idxes) | ||
batch = SampleBatch({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could just have the replay_buffer
return a SampleBatch
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I'm less concerned about that for now -- I'd rather keep replay_buffer unchanged as long as possible since it's still basically similar to the baselines code.
[ph for _, ph in self.loss_inputs], | ||
self.per_device_batch_size, | ||
lambda *ph: self.local_evaluator.build_tf_loss(ph), | ||
self.config.get("logdir", "/tmp/ray")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this directory? didn't logging directory change per #1297
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to os.getcwd()
Merged build finished. Test FAILed. |
Test FAILed. |
Merged build finished. Test FAILed. |
Test FAILed. |
Merged build finished. Test FAILed. |
Test FAILed. |
Jenkins is failing, but otherwise changes are fine |
Merged build finished. Test PASSed. |
Test PASSed. |
What do these changes do?
First cut at pulling out the multi-gpu optimizer, by adding it for DQN. New classes:
Pong results with 42x42 and multi_gpu on and off. Perf with multi_gpu is a bit lower, but that's to be expected since we have to do a separate pass for prioritization.
Related issue number
#1291