Skip to content

Conversation

@lyx-x
Copy link
Contributor

@lyx-x lyx-x commented Feb 25, 2018

I rewrote the PCL agent to avoid memory issues when saving Variables inside list / replay buffer. I didn't compare the training curve with the old one, but it seems to learn (the average_value increases and R gets bigger) on Catpole under the new parameters and there is no memory issue when run with large network / reasonably long trajectories.

Main methods are the following:

update: take a loss (as an array), log the result as usual and call optimizer (the backprop is done before this function is called)
update_on_policy and update_from_replay: sample a list of trajectories (from replay or the current one), clear grads and compute loss
compute_loss: take a list of trajectories, perform batch computation (batch size is the number of episodes, which may not be efficient when there is one single episode for on-policy update). This function will call backward immediately and only return an array for logging
_compute_path_consistency: compute path consistency, this part of code is almost unchanged

The new underlying data structure is a list of dict to store the current episode, then a replay buffer that only stores (s,a,r) pairs. The old mu (action_distrib) is removed since it can be recomputed again from other items.

I also added a unified model in the example script and changed a couple of parameters.

Issues addressed: #109 #236 #240

I am not sure if the parameters are used correctly, but if they are correct, this PR also addresses #238

@muupan
Copy link
Member

muupan commented Mar 15, 2018

Thank you for the improvements on PCL. I haven't checked the implementation details yet, but I think solving the memory issue is great as long as it won't make training slow.

Can you show the training curves and computation speeds before and after this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants