Skip to content

[refactor] Run Trainers in separate threads #3690

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 49 commits into from
Apr 20, 2020
Merged

[refactor] Run Trainers in separate threads #3690

merged 49 commits into from
Apr 20, 2020

Conversation

ervteng
Copy link
Contributor

@ervteng ervteng commented Mar 25, 2020

Proposed change(s)

This began as a fix to SAC's num_updates functionality. and evolved into a bit more than that.

Changes to SAC and why they were necessary

Currently SAC updates each time advance() is called, unless train_interval is specified, in which case it updates every train_interval steps. num_updates refers to the number of batches sampled and used per updates. So with these two parameters you can control how often updates happen relative to steps.

The problem becomes when you scale the number of areas or envs. Now all of a sudden we have a much greater number of steps per update, and the parameter doesn’t scale.

This PR introduces a steps_per_update parameter to SAC. SAC will keep track of the number of steps it receives when a trajectory comes in, and do the appropriate number of updates. HOWEVER, this produced a bad user experience, as the game would randomly "freeze" to do training.

Further changes

To get around this problem, this PR also puts the Trainers' advance() calls in a separate thread. We do this by making AgentManagerQueue a thread-safe blocking Queue, and semi-blocking (get with timeout) on the queues in advance()

This has the added benefit of (as measured on 3DBall, single environment):

  • 15% step throughput improvement on PPO
  • 13% step throughput improvement on SAC

Note that this slightly breaks the on-policyness of PPO; steps taken during a buffer update will use the in-between parameterizations (e.g. a step taken after the first batch will have that parameterization, after the 2nd batch will have the next, and so on). I've made threading disableable easily with a flag in TrainerController so we could turn it on/off if we see significant degradation in performance.

TODO: test Ghost Trainer, change documentation. Test with larger number of envs.
Future: Switch to full multiprocess (requires a rethink of the StatsReporter shared between the trainer and the AgentProcessor).

Types of change(s)

  • Bug fix
  • New feature
  • Code refactor
  • Breaking change
  • Documentation update
  • Other (please describe)

Checklist

  • Added tests that prove my fix is effective or that my feature works
  • Updated the changelog (if applicable)
  • Updated the documentation (if applicable)
  • Updated the migration guide (if applicable)

Other comments


Typical Range: `1`
Typical Range: `10` - `20`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning behind this range? It's not atypical to perform an update on every step of an agent. Granted, that doesn't scale if there are ten agents and every single one takes a step per timestep in which case the minimum is 10. Maybe this should be written as a function of the number of agents/envs?

Copy link
Contributor Author

@ervteng ervteng Apr 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to keep in-line with our previous SAC behavior, which was one update per env step (~10 agent steps). From empirical trials I noticed a small sample efficiency improvement pushing this down to 1 - but huge wallclock time degradation as the updates take too long. This was even with Snoopy Pop, the slowest env we have.

This number should be independent of # of agents/envs unlike num_updates. Maybe we should lower the range to 1-20?

@andrewcoh
Copy link
Contributor

How did you test the GhostTrainer?

@ervteng ervteng requested review from harperj and removed request for harperj April 3, 2020 20:51
@ervteng
Copy link
Contributor Author

ervteng commented Apr 3, 2020

How did you test the GhostTrainer?

Ran it for half an hour and the ELO went up a bit. Am running a full cloud run of all envs now.

batch_update_stats: Dict[str, list] = defaultdict(list)
for _ in range(num_updates):
while self.step / self.update_steps > self.steps_per_update:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the new hyperparameter enforced as "every time a steps_per_update num of steps elapses, we perform an update" or "after x steps elapses, we perform x/steps_per_update updates." Judging from this line, is it the latter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these things the same thing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewcoh @ervteng did you work this out?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ervteng I don't think they are the same thing. If (1) is enforced then (2) is true, but not necessarily is the inverse true. Not sure it's a big deal.

@ervteng ervteng requested review from harperj and andrewcoh April 3, 2020 21:46
@ervteng ervteng changed the title [WIP] Put Trainers in separate threads [refactor] Put Trainers in separate threads Apr 3, 2020
@ervteng ervteng changed the title [refactor] Put Trainers in separate threads [refactor] Run Trainers in separate threads Apr 3, 2020
@ervteng ervteng marked this pull request as ready for review April 3, 2020 21:47
"""
Advances the trainer. Typically, this means grabbing trajectories
from all subscribed trajectory queues (self.trajectory_queues), and updating
a policy using the steps in them, and if needed pushing a new policy onto the right
policy queues (self.policy_queues).
:param empty_queue: Whether or not to empty the queue when called. For synchronous
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:param empty_queue: Whether or not to empty the queue when called. For synchronous
:param empty_queue: Whether or not to empty the trajectory queue when called. For synchronous

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed empty_queue and am using qsize to empty the queue every time.

@@ -142,16 +146,18 @@ def advance(self) -> None:
# being emptied, the trajectories in the queue are on-policy.
for _ in range(traj_queue.maxlen):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be traj_queue.qsize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 changed.

@@ -142,16 +146,18 @@ def advance(self) -> None:
# being emptied, the trajectories in the queue are on-policy.
for _ in range(traj_queue.maxlen):
try:
t = traj_queue.get_nowait()
t = traj_queue.get(block=not empty_queue, timeout=0.05)
self._process_trajectory(t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we still process the trajectory if we have exceeded our buffer size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a possibility - I guess the question is whether it is better to dump trajectories or exceed the buffer size. I'm leaning towards exceeding the buffer size. @andrewcoh do you have any preferences?

Ultimately we should change this loop so that we stop getting from the queue if we need to update.

Copy link
Contributor

@harperj harperj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :shipit:

### (Optional) Advanced: Disable Threading

By default, PPO model updates can happen while the environment is being stepped. To disable this
behavior, for instance to maintain strict
Copy link
Contributor

@andrewcoh andrewcoh Apr 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this sentence could be a bit clearer. "To disable this, do this." Then, "One may want to do this to maintain the strict on-policyness of PPO..." etc.

It should maybe also be clearer that you're violating an assumption of PPO in exchange for training speed up. As it's written, it feels like there's only gain to enabling threading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the docs - hopefully the new one is clearer

Ervin T and others added 5 commits April 17, 2020 15:56
Fix typo

Co-Authored-By: andrewcoh <54679309+andrewcoh@users.noreply.github.com>
Co-Authored-By: andrewcoh <54679309+andrewcoh@users.noreply.github.com>
Copy link
Contributor

@andrewcoh andrewcoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@ervteng ervteng merged commit 99ed28e into master Apr 20, 2020
mmattar pushed a commit that referenced this pull request Apr 20, 2020
commit 3fed09d
Author: Ervin T <ervin@unity3d.com>
Date:   Mon Apr 20 13:21:28 2020 -0700

    [bug-fix] Increase buffer size for SAC tests (#3813)

commit 99ed28e
Author: Ervin T <ervin@unity3d.com>
Date:   Mon Apr 20 13:06:39 2020 -0700

    [refactor] Run Trainers in separate threads (#3690)

commit 52b7d2e
Author: Chris Elion <chris.elion@unity3d.com>
Date:   Mon Apr 20 12:20:45 2020 -0700

    update upm-ci-utils source (#3811)

commit 89e4804
Author: Vincent-Pierre BERGES <vincentpierre@unity3d.com>
Date:   Mon Apr 20 12:06:59 2020 -0700

    Removing done from the llapi doc (#3810)
@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants