Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TD3 Code review #245

Merged
merged 38 commits into from
Feb 27, 2021
Merged

TD3 Code review #245

merged 38 commits into from
Feb 27, 2021

Conversation

ernestum
Copy link
Collaborator

@ernestum ernestum commented Nov 28, 2020

Description

Discussion MR for the TD3 implementation

Motivation and Context

This code review is part of the review of existing code #17.

Also fixes issues with terminal observation.

closes #17
closes #331

  • I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)

Checklist:

  • I've read the CONTRIBUTION guide (required)
  • I have updated the changelog accordingly (required).
  • My change requires a change to the documentation.
  • I have updated the tests accordingly (required for a bug fix or a new feature).
  • I have updated the documentation accordingly.
  • I have reformatted the code using make format (required)
  • I have checked the codestyle using make check-codestyle and make lint (required)
  • I have ensured make pytest and make type both pass. (required)
  • I have checked that the documentation builds using make doc (required)

Note: You can run most of the checks using make commit-checks.

Note: we are using a maximum length of 127 characters per line

araffin and others added 4 commits December 10, 2020 20:07
* Add learning rate schedule example

* Update docs/guide/examples.rst

Co-authored-by: Adam Gleave <adam@gleave.me>

* Address comments

Co-authored-by: Adam Gleave <adam@gleave.me>
* Add supported action spaces checks

* Address comment
… Actor since it always is deterministic anyways.
_get_data was too generic and could have meant anything.
@araffin
Copy link
Member

araffin commented Dec 13, 2020

@ernestum are you done with the edits?

@ernestum
Copy link
Collaborator Author

No I still wanted to implement the more convenient way to specify the training frequency based on either steps or episodes. I am on it right now.

I noticed that SAC does not support updating after n episodes (just after n steps). Was it supposed to be that way? Because after the patch that will be possible.

@ernestum
Copy link
Collaborator Author

Another question regarding this line (and all other lines where a polyak_update is done):

if gradient_step % self.policy_delay == 0:
    ...
    polyak_update(...)

Note that this triggers a polyak update in each first iteration of the outer for gradient_step in range(gradient_steps) loop since 0 % x == 0 for all x.
In the case where we set train_freq=1 the policy is not delayed since it will be polyak-updated after each step. I guess this should be more like if self.total_train_steps % self.policy_delay == 0: right?

@Miffyli
Copy link
Collaborator

Miffyli commented Jan 4, 2021

I noticed that SAC does not support updating after n episodes (just after n steps). Was it supposed to be that way? Because after the patch that will be possible.

Jumping in here to comment on this. Is this something people do on robotics side? I have personally not seen "update every N episodes" type of setups, but I mainly focus on games and non-robotics envs. It sounds very sensitive to problems (e.g. agent gets better -> episodes last longer -> less updates), and support for this could be confusing. If support for this is added it should be blatantly clear how "update every N steps" and "update every N episodes" function together (e.g. do they override each other? In which order? Do they both apply?)

@araffin
Copy link
Member

araffin commented Jan 6, 2021

I noticed that SAC does not support updating after n episodes (just after n steps). Was it supposed to be that way? Because after the patch that will be possible.

why not? (I've been using that feature on a real robot...)

In the case where we set train_freq=1 the policy is not delayed since it will be polyak-updated after each step. I guess this should be more like if self.total_train_steps % self.policy_delay == 0: right?

oh... nice catch ;)

Is this something people do on robotics side?

yes, mostly because everything happens in real time, you cannot afford having inconsistent and slow control frequency because your policy is updating (whereas in simulation, you can stop the simulation until you are ready again).

If support for this is added it should be blatantly clear how "update every N steps" and "update every N episodes" function together (e.g. do they override each other? In which order? Do they both apply?)

they are already implemented, and there is a big warning when one is trying to use both at the same time.
@ernestum proposal was in fact to make things clearer.

EDIT: we need to update the version too

@@ -250,11 +239,17 @@ def learn(
callback.on_training_start(locals(), globals())

while self.num_timesteps < total_timesteps:
if isinstance(self.train_freq, int):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I would do that transform directly in the constructor, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean I should add a comment describing the translation from the old style to the new style?

I deliberately kept the transition here since otherwise we would 1. need to store n_steps and n_episodes in the OffPolicyAlgorithm and 2. the user could not change the train_freq in hindsight or during training. Its unlikely that it is needed but WHEN you need it it is really annoying.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean I should add a comment describing the translation from the old style to the new style?

yes

  1. need to store n_steps and n_episodes in the OffPolicyAlgorithm

this is already the case (self.n_episodes_rollout and self.train_freq)

  1. the user could not change the train_freq in hindsight or during training.

it can if we store it.
Also, if needed, the user can direct call collect_rollout alone if needed.

Last thing, I think the check can be simplified if we have an assert about self.train_freq[1] before (see comment above ;))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope self.n_episodes_rollout is gone now. only self.train_freq is left and it is split up into old-style n_steps and n_episodes before passing it to collect_rollouts since I was too lazy to also change that implementation too (but we should do that in the future).

I did not quite get what you mean with the user calling collect_rollout directly.

Good point about normalizing self.train_freq to be a tuple in the constructor. I will do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I meant was keeping self.n_episodes_rollout (that would be nice to load old models too) but computing the value of self.train_freq and self.n_episodes_rollout in the constructor.

I did not quite get what you mean with the user calling collect_rollout directly.

the same way the user can call model.train() (just updating the model), the user can also directly call model.collect_rollout() which the parameters the user wants.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be reluctant to add the public self.n_episodes_rollout back. This way we basically build an API with two ways to specify the number of epochs between two training steps and introduce back the issue that the value of self.n_episodes_rollout can conflict with self.train_freq. If you like I can come up with a separate solution to ensure backwards compatibility with old stored models.

Since, as you pointed out, model.collect_rollout() is part of the public API and it still has the old-style n_episodes and n_steps interface that one should actually be updated to the new tuple format as well and I should stop being lazy!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be reluctant to add the public self.n_episodes_rollout back

Fair enough ;)

nterface that one should actually be updated to the new tuple format as well and I should stop being lazy!

if you do so, you may also need to remove the replay_buffer parameter ;) (which was here for historic reasons)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted ...

@@ -250,11 +239,17 @@ def learn(
callback.on_training_start(locals(), globals())

while self.num_timesteps < total_timesteps:
if isinstance(self.train_freq, int):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean I should add a comment describing the translation from the old style to the new style?

yes

  1. need to store n_steps and n_episodes in the OffPolicyAlgorithm

this is already the case (self.n_episodes_rollout and self.train_freq)

  1. the user could not change the train_freq in hindsight or during training.

it can if we store it.
Also, if needed, the user can direct call collect_rollout alone if needed.

Last thing, I think the check can be simplified if we have an assert about self.train_freq[1] before (see comment above ;))

"`number of episodes` > `n_episodes_rollout`"
)
if isinstance(self.train_freq, int):
self.train_freq = (self.train_freq, "step")
Copy link
Member

@araffin araffin Jan 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think of using NamedTuple? so later you can do train_freq.n and train_freq.unit instead of train_freq[1] ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure but can I have train_freq.quantity and train_freq.unit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quantity sounds a bit weird to me... I think you wrote frequency in the docstring, no?

Copy link
Collaborator Author

@ernestum ernestum Jan 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I was borrowing terminology from here and messed it up. In our case the train_freq is the quantity, your n is the magnitude unit of measurement must be either "step" or "episode" (but this could be extended in the future). So I propose train_freq.magnitude and train_freq.unit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really for go train_freq.frequency or train_freq.interval (so discrete values, that's why I proposed the n at first, from the maths side) as magnitude sounds like a continuous value to me.
and yes for the train_freq.unit ;)

@araffin
Copy link
Member

araffin commented Jan 30, 2021

The current failure on the CI is due to a newer version of numpy, we should replace np.bool by bool and that should solve it ;)

@ernestum
Copy link
Collaborator Author

The current failure on the CI is due to a newer version of numpy, we should replace np.bool by bool and that should solve it ;)

Thanks for the hint I will look into fixing it tomorrow.

@araffin
Copy link
Member

araffin commented Feb 27, 2021

@AdamGleave my last commit (d83cdc9) might affect your imitation learning lib ;) (it's about having the right terminals)

Copy link
Member

@araffin araffin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fixes and the discussion =)

@AdamGleave
Copy link
Collaborator

@AdamGleave my last commit (d83cdc9) might affect your imitation learning lib ;) (it's about having the right terminals)

Thanks for flagging! Tagging @shwang FYI: we have previously been seeing unnormalized terminal observations, this PR will fix that.

@araffin araffin merged commit 0c50d75 into master Feb 27, 2021
@araffin araffin deleted the td3_review branch February 27, 2021 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] action_noise never got used in HER Review of Existing Code
4 participants