-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TD3 Code review #245
TD3 Code review #245
Conversation
…in the TD3 Actor.
* Add learning rate schedule example * Update docs/guide/examples.rst Co-authored-by: Adam Gleave <adam@gleave.me> * Address comments Co-authored-by: Adam Gleave <adam@gleave.me>
* Add supported action spaces checks * Address comment
… Actor since it always is deterministic anyways.
_get_data was too generic and could have meant anything.
@ernestum are you done with the edits? |
No I still wanted to implement the more convenient way to specify the training frequency based on either steps or episodes. I am on it right now. I noticed that SAC does not support updating after n episodes (just after n steps). Was it supposed to be that way? Because after the patch that will be possible. |
Another question regarding this line (and all other lines where a if gradient_step % self.policy_delay == 0:
...
polyak_update(...) Note that this triggers a polyak update in each first iteration of the outer |
Jumping in here to comment on this. Is this something people do on robotics side? I have personally not seen "update every N episodes" type of setups, but I mainly focus on games and non-robotics envs. It sounds very sensitive to problems (e.g. agent gets better -> episodes last longer -> less updates), and support for this could be confusing. If support for this is added it should be blatantly clear how "update every N steps" and "update every N episodes" function together (e.g. do they override each other? In which order? Do they both apply?) |
why not? (I've been using that feature on a real robot...)
oh... nice catch ;)
yes, mostly because everything happens in real time, you cannot afford having inconsistent and slow control frequency because your policy is updating (whereas in simulation, you can stop the simulation until you are ready again).
they are already implemented, and there is a big warning when one is trying to use both at the same time. EDIT: we need to update the version too |
@@ -250,11 +239,17 @@ def learn( | |||
callback.on_training_start(locals(), globals()) | |||
|
|||
while self.num_timesteps < total_timesteps: | |||
if isinstance(self.train_freq, int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I would do that transform directly in the constructor, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean I should add a comment describing the translation from the old style to the new style?
I deliberately kept the transition here since otherwise we would 1. need to store n_steps
and n_episodes
in the OffPolicyAlgorithm
and 2. the user could not change the train_freq
in hindsight or during training. Its unlikely that it is needed but WHEN you need it it is really annoying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean I should add a comment describing the translation from the old style to the new style?
yes
- need to store n_steps and n_episodes in the OffPolicyAlgorithm
this is already the case (self.n_episodes_rollout
and self.train_freq
)
- the user could not change the train_freq in hindsight or during training.
it can if we store it.
Also, if needed, the user can direct call collect_rollout
alone if needed.
Last thing, I think the check can be simplified if we have an assert about self.train_freq[1]
before (see comment above ;))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope self.n_episodes_rollout
is gone now. only self.train_freq
is left and it is split up into old-style n_steps
and n_episodes
before passing it to collect_rollouts
since I was too lazy to also change that implementation too (but we should do that in the future).
I did not quite get what you mean with the user calling collect_rollout
directly.
Good point about normalizing self.train_freq
to be a tuple in the constructor. I will do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what I meant was keeping self.n_episodes_rollout
(that would be nice to load old models too) but computing the value of self.train_freq
and self.n_episodes_rollout
in the constructor.
I did not quite get what you mean with the user calling collect_rollout directly.
the same way the user can call model.train()
(just updating the model), the user can also directly call model.collect_rollout()
which the parameters the user wants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be reluctant to add the public self.n_episodes_rollout
back. This way we basically build an API with two ways to specify the number of epochs between two training steps and introduce back the issue that the value of self.n_episodes_rollout
can conflict with self.train_freq
. If you like I can come up with a separate solution to ensure backwards compatibility with old stored models.
Since, as you pointed out, model.collect_rollout()
is part of the public API and it still has the old-style n_episodes
and n_steps
interface that one should actually be updated to the new tuple format as well and I should stop being lazy!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be reluctant to add the public self.n_episodes_rollout back
Fair enough ;)
nterface that one should actually be updated to the new tuple format as well and I should stop being lazy!
if you do so, you may also need to remove the replay_buffer
parameter ;) (which was here for historic reasons)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noted ...
@@ -250,11 +239,17 @@ def learn( | |||
callback.on_training_start(locals(), globals()) | |||
|
|||
while self.num_timesteps < total_timesteps: | |||
if isinstance(self.train_freq, int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean I should add a comment describing the translation from the old style to the new style?
yes
- need to store n_steps and n_episodes in the OffPolicyAlgorithm
this is already the case (self.n_episodes_rollout
and self.train_freq
)
- the user could not change the train_freq in hindsight or during training.
it can if we store it.
Also, if needed, the user can direct call collect_rollout
alone if needed.
Last thing, I think the check can be simplified if we have an assert about self.train_freq[1]
before (see comment above ;))
…ning into an assert.
"`number of episodes` > `n_episodes_rollout`" | ||
) | ||
if isinstance(self.train_freq, int): | ||
self.train_freq = (self.train_freq, "step") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think of using NamedTuple? so later you can do train_freq.n
and train_freq.unit
instead of train_freq[1]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure but can I have train_freq.quantity
and train_freq.unit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quantity
sounds a bit weird to me... I think you wrote frequency
in the docstring, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I was borrowing terminology from here and messed it up. In our case the train_freq
is the quantity, your n
is the magnitude unit of measurement must be either "step"
or "episode"
(but this could be extended in the future). So I propose train_freq.magnitude
and train_freq.unit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would really for go train_freq.frequency
or train_freq.interval
(so discrete values, that's why I proposed the n
at first, from the maths side) as magnitude
sounds like a continuous value to me.
and yes for the train_freq.unit
;)
The current failure on the CI is due to a newer version of numpy, we should replace |
Thanks for the hint I will look into fixing it tomorrow. |
…or episodes in the collect_collouts of the off policy algorithm.
…or episodes in the collect_collouts of HER.
… Also add some type annotations and documentation.
@AdamGleave my last commit (d83cdc9) might affect your imitation learning lib ;) (it's about having the right terminals) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the fixes and the discussion =)
Thanks for flagging! Tagging @shwang FYI: we have previously been seeing unnormalized terminal observations, this PR will fix that. |
Description
Discussion MR for the TD3 implementation
Motivation and Context
This code review is part of the review of existing code #17.
Also fixes issues with terminal observation.
closes #17
closes #331
Types of changes
Checklist:
make format
(required)make check-codestyle
andmake lint
(required)make pytest
andmake type
both pass. (required)make doc
(required)Note: You can run most of the checks using
make commit-checks
.Note: we are using a maximum length of 127 characters per line