Move add_experiences out of trainer, add Trajectories #3067

ervteng · 2019-12-10T22:17:29Z

Moves add_experiences out of the trainer class, and into a separate AgentProcessor class. Also, introduce the Trajectory abstraction.

In this new design, the AgentProcessor is given the BrainInfos through an add_experiences method. This then assembles Trajectories (lists of AgentExperiences) that are passed to a Trainer. The trainer can assume all AgentExperiences within a Trajectory belong to the same agent, and ingest it into the update buffer as appropriate.

NOTE: New normalization changes causes slight degradation in some environments (Crawler, Reacher) and a boost in other (Walker.) It also seems to improve performance with SAC (makes sense). New normalization should be more correct, as before we were taking the running mean and std of the mean obs across all agents, while now we are doing every experience separately.

ToDo for this PR:

~~Add TensorBoard logger that spans both inference stats (entropy, value estimates, rewards, etc.) and training stats.~~
~~Verify training performance (reward) is same. Currently seeing a small degradation in Crawler, Walker, and Reacher.~~

ToDo in subsequent PRs:

Add Queues between AgentProcessor and Trainers, to be stored as part of the AgentManager NamedTuple.
Make process_trajectory a private method in the Trainer, and rather give the trainer an advance() method that will ingest a trajectory from the queue and call process_trajectory.
Remove reference to Trainer in AgentProcessor.
Change internal buffer representation inside Trainer to be made of AgentExperiences rather than the old AgentBuffer format.

… develop-agentprocessor

ml-agents/mlagents/trainers/agent_processor.py

chriselion · 2019-12-18T19:10:44Z

ml-agents/mlagents/trainers/agent_processor.py

+        self.episode_steps: Counter = Counter()
+        self.episode_rewards: Dict[str, float] = defaultdict(float)
+        self.stats_reporter = stats_reporter
+        if max_trajectory_length:


Maybe make this default to a huge number instead (like sys.maxsize)? That might get rid of the need for ignore_max_length

Redid the logic to use sys.maxsize as default and now feeds that in in trainer_controller.py when no time_horizon is specified.

chriselion · 2019-12-18T19:16:52Z

ml-agents/mlagents/trainers/agent_processor.py

+            if stored_info is not None:
+                stored_take_action_outputs = self.last_take_action_outputs[agent_id]
+                idx = stored_info.agents.index(agent_id)
+                next_idx = next_info.agents.index(agent_id)


You shouldn't need to look this up since you're already iterating over next_info.agents, right? Just change

for agent_id in next_info.agents:

to

for next_idx, agent_id in enumerate(next_info.agents):

Good call - changed.

chriselion · 2019-12-18T19:17:51Z

ml-agents/mlagents/trainers/agent_processor.py

+            stored_info = self.last_brain_info.get(agent_id, None)
+            if stored_info is not None:
+                stored_take_action_outputs = self.last_take_action_outputs[agent_id]
+                idx = stored_info.agents.index(agent_id)


Little worried about the O(N) lookup here since we're doing it N times.

You might want to do something like

agent_id_to_index = {agent_id, i for i, agent_id in enumerate(stored_info.agents)}

outside the loop.

The tricky bit here is that the stored_info might be different per iteration of the loop (some agents in next_info might not have been in the previous info and vice-versa). So the index might change as well. To make matters worse, we do this indexing twice (once here, and once in the LL-Python API to convert BatchedState-> BrainInfo).

Long-term we will be removing BrainInfo, (today: BatchedState -> BrainInfo -> AgentExperience, end goal: BatchedState -> AgentExperience) so I think we will be able to get away with simply adding to trajectories agent-by-agent. We won't have to store the stored_info anymore. In this case, we will only have to do the indexing once.

chriselion · 2019-12-18T19:27:58Z

ml-agents/mlagents/trainers/rl_trainer.py

        # of what reward signals are actually present.
-        self.collected_rewards = {"environment": {}}
-        self.processing_buffer = ProcessingBuffer()
+        self.collected_rewards = {"environment": defaultdict(lambda: 0)}


nit: type hint

chriselion · 2019-12-18T20:43:41Z

ml-agents/mlagents/trainers/stats.py

+            filewriter_dir = "{basedir}/{category}".format(
+                basedir=self.base_dir, category=category
+            )
+            if not os.path.exists(filewriter_dir):


I might have missed this before - you can just do os.makedirs(filewriter_dir, exist_ok=True) instead of checking os.path.exists(). see https://docs.python.org/3/library/os.html#os.makedirs

chriselion · 2019-12-18T20:47:46Z

ml-agents/mlagents/trainers/tests/test_stats.py

+    assert statssummary2.num == 10
+    assert statssummary1.mean == 4.5
+    assert statssummary2.mean == 4.5
+    assert round(statssummary1.std, 1) == 2.9


I think you can also do assert statssummary1.std == pytest.approx(2.9) - maybe a little cleaner.

pytest.approx(2.9, abs=0.1) does the same thing I think

chriselion · 2019-12-18T20:52:05Z

ml-agents/mlagents/trainers/tests/test_stats.py

+    # Test write_stats
+    base_dir = "base_dir"
+    category = "category1"
+    tb_writer = TensorboardWriter(base_dir)


Will this actually make a directory somewhere if I run tests locally? If so, you should do something like

import tempfile ... with tempfile.TemporaryDirectory(prefix="unittest-") as base_dir: # rest of test here

Then the directory will get automatically cleaned up when the context manager closes.

Confirmed that this create ./base_dir and ./base_dir/category1 locally. Please use a tempfile.

Test now uses tempfile.

chriselion · 2019-12-18T20:55:05Z

ml-agents/mlagents/trainers/trainer.py

+    def write_tensorboard_text(self, key: str, input_dict: Dict[str, Any]) -> None:
+        """
+        Saves text to Tensorboard.
+        Note: Only works on tensorflow r1.2 or above.


nit: we only support tf >= 1.7, so we can drop this comment

chriselion · 2019-12-18T20:56:54Z

ml-agents/mlagents/trainers/trainer_controller.py

+                                trainer,
+                                trainer.policy,
+                                trainer.parameters["time_horizon"]
+                                if "time_horizon" in trainer.parameters


replace with trainer.parameters.get("time_horizon")

good call - I now get with a default value of sys.maxsize.

chriselion · 2019-12-18T21:03:41Z

ml-agents/mlagents/trainers/trajectory.py

+        """
+        agent_buffer_trajectory = AgentBuffer()
+        for step, exp in enumerate(self.steps):
+            vec_vis_obs = SplitObservations.from_observations(exp.obs)


Note that you're calling SplitObservations.from_observations twice on each element (except maybe the start and end). Might be worth changing the loop to something like

vec_vis_obs = SplitObservations.from_observations(self.steps[0]) for step, exp in enumerate(self.steps): # get next_vec_vis_obs same as now # rest of loop vec_vis_obs = next_vec_vis_obs

but then again, from_observations looks pretty fast unless you're concatenating large vector observations.

chriselion · 2019-12-18T21:16:55Z

ml-agents/mlagents/trainers/trajectory.py

+        :param obs: List of numpy arrays (observation)
+        :returns: A SplitObservations object.
+        """
+        vis_obs_indices = []


I think you should avoid creating the indices arrays here. You can form vis_obs and the input to np.concatenate() directly on the loop through obs.

Good call - I now just build the list directly.

I think we use the same logic in the brain_conversion_utils.py file as well

chriselion · 2019-12-18T21:17:55Z

ml-agents/mlagents/trainers/trajectory.py

+
+            agent_buffer_trajectory["prev_action"].append(exp.prev_action)
+
+            # Add the value outputs if needed


nit: not sure this comment matches the code.

Nope - removed.

chriselion · 2019-12-18T21:20:58Z

ml-agents/mlagents/trainers/trainer.py

-        if not os.path.exists(self.summary_path):
-            os.makedirs(self.summary_path)
+        self.stats_reporter = StatsReporter(self.summary_path)
+        # if not os.path.exists(self.summary_path):


nit: dead code.

Thanks - removed

chriselion · 2019-12-18T21:36:34Z

ml-agents/mlagents/trainers/agent_processor.py

+            self.last_take_action_outputs[agent_id] = take_action_outputs
+
+        # Store the environment reward
+        tmp_environment_reward = np.array(next_info.rewards, dtype=np.float32)


Do you need to convert this to an np.array? Looks like you only use it twice and always look it up by index.

At some point this was necessary but not anymore - fixed.

chriselion · 2019-12-18T21:44:28Z

ml-agents/mlagents/trainers/models.py

-            mean_current_observation - self.running_mean
+        # Based on Welford's algorithm for running mean and standard deviation, for batch updates. Discussion here:
+        # https://stackoverflow.com/questions/56402955/whats-the-formula-for-welfords-algorithm-for-variance-std-with-batch-updates
+        steps_increment = tf.shape(vector_input)[0]


Is it possible to unit test this with some synthetic data?

Added a test for normalization in test_ppo

chriselion

Looks good overall, some minor feedback. Feel free to punt some of it to a followup PR

Ervin Teng added 30 commits November 22, 2019 15:36

Split buffer into two buffers (PPO works)

9a16838

buffer split for SAC

55b2918

Fix buffer tests and truncate

38f5795

Fix RL tests

453dd4c

Fix demo loader and remaining tests

b00f779

Remove MANIFEST file

3b7191b

Add type hints to Buffer

9c47678

Rename append_update_buffer to append_to_update_buffer

a57a220

Merge branch 'develop' into develop-splitbuffer

efe29c8

Non-working commit

f5f9598

Revert buffer for now

f3459eb

Another nonworking commit

0b603c7

Runs but doesn't do anything yet

ea6e79d

Merge branch 'develop' into develop-agentprocessor

a264b48

Use ProcessingBuffer in AgentProcessor

5e4f1bc

Convert to trajectory

a5ac988

Looks like it's training

a2e33e8

Fix memory leak

7004db8

Attempt reward reporting

0863ff5

Stats reporting is working

88feb1b

Clean up some stuff

d6fe367

No longer using ProcessingBuffer for PPO

8e43ecd

Move trajectory and related functions to trajectory.py

2b32d61

Add back max_step logic

991be2c

Merge branch 'master' of github.com:Unity-Technologies/ml-agents into…

9b7969b

… develop-agentprocessor

Remove epsilon

5efd4e9

Migrate SAC

3bfe3df

Remove dead code

f7649ae

Move some common logic to buffer class

6b40d00

Kill the ProcessingBuffer

bf59521

chriselion reviewed Dec 18, 2019

View reviewed changes

ml-agents/mlagents/trainers/agent_processor.py Show resolved Hide resolved

chriselion reviewed Dec 18, 2019

View reviewed changes

chriselion approved these changes Dec 18, 2019

View reviewed changes

Ervin Teng added 10 commits December 18, 2019 14:48

Remove dead code

10dcc1b

Add type hints to rl_trainer

2d72b06

Cleanup agent_processor

a0c76c7

Make file creation safer

b1060e5

Fix error message

70f91af

Clean up trajectory and splitobs

8a44fc5

Use .get for trainer_parameters

919a00b

Add test for normalization

7122d39

Float32 array in test

cb1ec87

Fix comment in test

9d554bb

ervteng merged commit dfe9c11 into master Dec 19, 2019

delete-merged-branch bot deleted the develop-agentprocessor branch December 19, 2019 01:24

github-actions bot locked as resolved and limited conversation to collaborators May 17, 2021


		agent_buffer_trajectory["prev_action"].append(exp.prev_action)

		# Add the value outputs if needed

Move add_experiences out of trainer, add Trajectories #3067

Move add_experiences out of trainer, add Trajectories #3067

Uh oh!

Conversation

ervteng commented Dec 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chriselion Dec 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chriselion left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ervteng commented Dec 10, 2019 •

edited

Loading

chriselion Dec 18, 2019 •

edited

Loading