-
Notifications
You must be signed in to change notification settings - Fork 4.3k
[refactor] Run Trainers in separate threads #3690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
docs/Training-SAC.md
Outdated
|
||
Typical Range: `1` | ||
Typical Range: `10` - `20` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning behind this range? It's not atypical to perform an update on every step of an agent. Granted, that doesn't scale if there are ten agents and every single one takes a step per timestep in which case the minimum is 10. Maybe this should be written as a function of the number of agents/envs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was to keep in-line with our previous SAC behavior, which was one update per env step (~10 agent steps). From empirical trials I noticed a small sample efficiency improvement pushing this down to 1 - but huge wallclock time degradation as the updates take too long. This was even with Snoopy Pop, the slowest env we have.
This number should be independent of # of agents/envs unlike num_updates
. Maybe we should lower the range to 1-20
?
How did you test the GhostTrainer? |
Ran it for half an hour and the ELO went up a bit. Am running a full cloud run of all envs now. |
batch_update_stats: Dict[str, list] = defaultdict(list) | ||
for _ in range(num_updates): | ||
while self.step / self.update_steps > self.steps_per_update: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the new hyperparameter enforced as "every time a steps_per_update num of steps elapses, we perform an update" or "after x steps elapses, we perform x/steps_per_update updates." Judging from this line, is it the latter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't these things the same thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewcoh @ervteng did you work this out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ervteng I don't think they are the same thing. If (1) is enforced then (2) is true, but not necessarily is the inverse true. Not sure it's a big deal.
""" | ||
Advances the trainer. Typically, this means grabbing trajectories | ||
from all subscribed trajectory queues (self.trajectory_queues), and updating | ||
a policy using the steps in them, and if needed pushing a new policy onto the right | ||
policy queues (self.policy_queues). | ||
:param empty_queue: Whether or not to empty the queue when called. For synchronous |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:param empty_queue: Whether or not to empty the queue when called. For synchronous | |
:param empty_queue: Whether or not to empty the trajectory queue when called. For synchronous |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed empty_queue and am using qsize
to empty the queue every time.
@@ -142,16 +146,18 @@ def advance(self) -> None: | |||
# being emptied, the trajectories in the queue are on-policy. | |||
for _ in range(traj_queue.maxlen): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be traj_queue.qsize
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 changed.
@@ -142,16 +146,18 @@ def advance(self) -> None: | |||
# being emptied, the trajectories in the queue are on-policy. | |||
for _ in range(traj_queue.maxlen): | |||
try: | |||
t = traj_queue.get_nowait() | |||
t = traj_queue.get(block=not empty_queue, timeout=0.05) | |||
self._process_trajectory(t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we still process the trajectory if we have exceeded our buffer size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a possibility - I guess the question is whether it is better to dump trajectories or exceed the buffer size. I'm leaning towards exceeding the buffer size. @andrewcoh do you have any preferences?
Ultimately we should change this loop so that we stop getting from the queue if we need to update.
… develop-sac-apex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
docs/Training-PPO.md
Outdated
### (Optional) Advanced: Disable Threading | ||
|
||
By default, PPO model updates can happen while the environment is being stepped. To disable this | ||
behavior, for instance to maintain strict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of this sentence could be a bit clearer. "To disable this, do this." Then, "One may want to do this to maintain the strict on-policyness of PPO..." etc.
It should maybe also be clearer that you're violating an assumption of PPO in exchange for training speed up. As it's written, it feels like there's only gain to enabling threading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the docs - hopefully the new one is clearer
Fix typo Co-Authored-By: andrewcoh <54679309+andrewcoh@users.noreply.github.com>
Co-Authored-By: andrewcoh <54679309+andrewcoh@users.noreply.github.com>
…gents into develop-sac-apex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
… develop-sac-apex
commit 3fed09d Author: Ervin T <ervin@unity3d.com> Date: Mon Apr 20 13:21:28 2020 -0700 [bug-fix] Increase buffer size for SAC tests (#3813) commit 99ed28e Author: Ervin T <ervin@unity3d.com> Date: Mon Apr 20 13:06:39 2020 -0700 [refactor] Run Trainers in separate threads (#3690) commit 52b7d2e Author: Chris Elion <chris.elion@unity3d.com> Date: Mon Apr 20 12:20:45 2020 -0700 update upm-ci-utils source (#3811) commit 89e4804 Author: Vincent-Pierre BERGES <vincentpierre@unity3d.com> Date: Mon Apr 20 12:06:59 2020 -0700 Removing done from the llapi doc (#3810)
Proposed change(s)
This began as a fix to SAC's
num_updates
functionality. and evolved into a bit more than that.Changes to SAC and why they were necessary
Currently SAC updates each time advance() is called, unless
train_interval
is specified, in which case it updates every train_interval steps.num_updates
refers to the number of batches sampled and used per updates. So with these two parameters you can control how often updates happen relative to steps.The problem becomes when you scale the number of areas or envs. Now all of a sudden we have a much greater number of steps per update, and the parameter doesn’t scale.
This PR introduces a
steps_per_update
parameter to SAC. SAC will keep track of the number of steps it receives when a trajectory comes in, and do the appropriate number of updates. HOWEVER, this produced a bad user experience, as the game would randomly "freeze" to do training.Further changes
To get around this problem, this PR also puts the Trainers' advance() calls in a separate thread. We do this by making AgentManagerQueue a thread-safe blocking Queue, and semi-blocking (get with timeout) on the queues in advance()
This has the added benefit of (as measured on 3DBall, single environment):
Note that this slightly breaks the on-policyness of PPO; steps taken during a buffer update will use the in-between parameterizations (e.g. a step taken after the first batch will have that parameterization, after the 2nd batch will have the next, and so on). I've made threading disableable easily with a flag in
TrainerController
so we could turn it on/off if we see significant degradation in performance.TODO: test Ghost Trainer, change documentation.Test with larger number of envs.Future: Switch to full multiprocess (requires a rethink of the StatsReporter shared between the trainer and the AgentProcessor).Types of change(s)
Checklist
Other comments