Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I rewrote the PCL agent to avoid memory issues when saving Variables inside list / replay buffer. I didn't compare the training curve with the old one, but it seems to learn (the average_value increases and R gets bigger) on Catpole under the new parameters and there is no memory issue when run with large network / reasonably long trajectories.
Main methods are the following:
update: take a loss (as an array), log the result as usual and call optimizer (the backprop is done before this function is called)update_on_policyandupdate_from_replay: sample a list of trajectories (from replay or the current one), clear grads and compute losscompute_loss: take a list of trajectories, perform batch computation (batch size is the number of episodes, which may not be efficient when there is one single episode for on-policy update). This function will call backward immediately and only return an array for logging_compute_path_consistency: compute path consistency, this part of code is almost unchangedThe new underlying data structure is a list of dict to store the current episode, then a replay buffer that only stores (s,a,r) pairs. The old mu (action_distrib) is removed since it can be recomputed again from other items.
I also added a unified model in the example script and changed a couple of parameters.
Issues addressed: #109 #236 #240
I am not sure if the parameters are used correctly, but if they are correct, this PR also addresses #238