TD3 Code review #245

ernestum · 2020-11-28T10:03:26Z

Description

Discussion MR for the TD3 implementation

Motivation and Context

This code review is part of the review of existing code #17.

Also fixes issues with terminal observation.

closes #17
closes #331

I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)

Checklist:

I've read the CONTRIBUTION guide (required)
I have updated the changelog accordingly (required).
My change requires a change to the documentation.
I have updated the tests accordingly (required for a bug fix or a new feature).
I have updated the documentation accordingly.
I have reformatted the code using make format (required)
I have checked the codestyle using make check-codestyle and make lint (required)
I have ensured make pytest and make type both pass. (required)
I have checked that the documentation builds using make doc (required)

Note: You can run most of the checks using make commit-checks.

Note: we are using a maximum length of 127 characters per line

…in the TD3 Actor.

* Add learning rate schedule example * Update docs/guide/examples.rst Co-authored-by: Adam Gleave <adam@gleave.me> * Address comments Co-authored-by: Adam Gleave <adam@gleave.me>

* Add supported action spaces checks * Address comment

… Actor since it always is deterministic anyways.

_get_data was too generic and could have meant anything.

stable_baselines3/common/policies.py

araffin · 2020-12-13T13:34:00Z

@ernestum are you done with the edits?

ernestum · 2020-12-30T14:53:16Z

No I still wanted to implement the more convenient way to specify the training frequency based on either steps or episodes. I am on it right now.

I noticed that SAC does not support updating after n episodes (just after n steps). Was it supposed to be that way? Because after the patch that will be possible.

…rain_freq instead.

ernestum · 2020-12-30T17:46:34Z

Another question regarding this line (and all other lines where a polyak_update is done):

if gradient_step % self.policy_delay == 0:
    ...
    polyak_update(...)

Note that this triggers a polyak update in each first iteration of the outer for gradient_step in range(gradient_steps) loop since 0 % x == 0 for all x.
In the case where we set train_freq=1 the policy is not delayed since it will be polyak-updated after each step. I guess this should be more like if self.total_train_steps % self.policy_delay == 0: right?

Miffyli · 2021-01-04T12:43:59Z

I noticed that SAC does not support updating after n episodes (just after n steps). Was it supposed to be that way? Because after the patch that will be possible.

Jumping in here to comment on this. Is this something people do on robotics side? I have personally not seen "update every N episodes" type of setups, but I mainly focus on games and non-robotics envs. It sounds very sensitive to problems (e.g. agent gets better -> episodes last longer -> less updates), and support for this could be confusing. If support for this is added it should be blatantly clear how "update every N steps" and "update every N episodes" function together (e.g. do they override each other? In which order? Do they both apply?)

araffin · 2021-01-06T14:17:47Z

I noticed that SAC does not support updating after n episodes (just after n steps). Was it supposed to be that way? Because after the patch that will be possible.

why not? (I've been using that feature on a real robot...)

In the case where we set train_freq=1 the policy is not delayed since it will be polyak-updated after each step. I guess this should be more like if self.total_train_steps % self.policy_delay == 0: right?

oh... nice catch ;)

Is this something people do on robotics side?

yes, mostly because everything happens in real time, you cannot afford having inconsistent and slow control frequency because your policy is updating (whereas in simulation, you can stop the simulation until you are ready again).

If support for this is added it should be blatantly clear how "update every N steps" and "update every N episodes" function together (e.g. do they override each other? In which order? Do they both apply?)

they are already implemented, and there is a big warning when one is trying to use both at the same time.
@ernestum proposal was in fact to make things clearer.

EDIT: we need to update the version too

stable_baselines3/common/off_policy_algorithm.py

araffin · 2021-01-06T14:20:01Z

stable_baselines3/common/off_policy_algorithm.py

@@ -250,11 +239,17 @@ def learn(
        callback.on_training_start(locals(), globals())

        while self.num_timesteps < total_timesteps:
+            if isinstance(self.train_freq, int):


missing comment

but I would do that transform directly in the constructor, no?

You mean I should add a comment describing the translation from the old style to the new style?

I deliberately kept the transition here since otherwise we would 1. need to store n_steps and n_episodes in the OffPolicyAlgorithm and 2. the user could not change the train_freq in hindsight or during training. Its unlikely that it is needed but WHEN you need it it is really annoying.

You mean I should add a comment describing the translation from the old style to the new style?

yes

need to store n_steps and n_episodes in the OffPolicyAlgorithm

this is already the case (self.n_episodes_rollout and self.train_freq)

the user could not change the train_freq in hindsight or during training.

it can if we store it.
Also, if needed, the user can direct call collect_rollout alone if needed.

Last thing, I think the check can be simplified if we have an assert about self.train_freq[1] before (see comment above ;))

nope self.n_episodes_rollout is gone now. only self.train_freq is left and it is split up into old-style n_steps and n_episodes before passing it to collect_rollouts since I was too lazy to also change that implementation too (but we should do that in the future).

I did not quite get what you mean with the user calling collect_rollout directly.

Good point about normalizing self.train_freq to be a tuple in the constructor. I will do that.

what I meant was keeping self.n_episodes_rollout (that would be nice to load old models too) but computing the value of self.train_freq and self.n_episodes_rollout in the constructor.

I did not quite get what you mean with the user calling collect_rollout directly.

the same way the user can call model.train() (just updating the model), the user can also directly call model.collect_rollout() which the parameters the user wants.

I would be reluctant to add the public self.n_episodes_rollout back. This way we basically build an API with two ways to specify the number of epochs between two training steps and introduce back the issue that the value of self.n_episodes_rollout can conflict with self.train_freq. If you like I can come up with a separate solution to ensure backwards compatibility with old stored models.

Since, as you pointed out, model.collect_rollout() is part of the public API and it still has the old-style n_episodes and n_steps interface that one should actually be updated to the new tuple format as well and I should stop being lazy!

I would be reluctant to add the public self.n_episodes_rollout back

Fair enough ;)

nterface that one should actually be updated to the new tuple format as well and I should stop being lazy!

if you do so, you may also need to remove the replay_buffer parameter ;) (which was here for historic reasons)

stable_baselines3/common/off_policy_algorithm.py

araffin · 2021-01-30T14:19:08Z

stable_baselines3/common/off_policy_algorithm.py

@@ -250,11 +239,17 @@ def learn(
        callback.on_training_start(locals(), globals())

        while self.num_timesteps < total_timesteps:
+            if isinstance(self.train_freq, int):


You mean I should add a comment describing the translation from the old style to the new style?

yes

need to store n_steps and n_episodes in the OffPolicyAlgorithm

this is already the case (self.n_episodes_rollout and self.train_freq)

the user could not change the train_freq in hindsight or during training.

it can if we store it.
Also, if needed, the user can direct call collect_rollout alone if needed.

Last thing, I think the check can be simplified if we have an assert about self.train_freq[1] before (see comment above ;))

…ning into an assert.

araffin · 2021-01-30T20:53:03Z

stable_baselines3/common/off_policy_algorithm.py

-                "`number of episodes` > `n_episodes_rollout`"
-            )
+        if isinstance(self.train_freq, int):
+            self.train_freq = (self.train_freq, "step")


what do you think of using NamedTuple? so later you can do train_freq.n and train_freq.unit instead of train_freq[1] ?

Sure but can I have train_freq.quantity and train_freq.unit?

quantity sounds a bit weird to me... I think you wrote frequency in the docstring, no?

You are right. I was borrowing terminology from here and messed it up. In our case the train_freq is the quantity, your n is the magnitude unit of measurement must be either "step" or "episode" (but this could be extended in the future). So I propose train_freq.magnitude and train_freq.unit.

I would really for go train_freq.frequency or train_freq.interval (so discrete values, that's why I proposed the n at first, from the maths side) as magnitude sounds like a continuous value to me.
and yes for the train_freq.unit ;)

araffin · 2021-01-30T21:25:50Z

The current failure on the CI is due to a newer version of numpy, we should replace np.bool by bool and that should solve it ;)

ernestum · 2021-01-30T21:32:05Z

The current failure on the CI is due to a newer version of numpy, we should replace np.bool by bool and that should solve it ;)

Thanks for the hint I will look into fixing it tomorrow.

…or episodes in the collect_collouts of the off policy algorithm.

…or episodes in the collect_collouts of HER.

… td3_review

… Also add some type annotations and documentation.

araffin · 2021-02-27T14:00:06Z

@AdamGleave my last commit (d83cdc9) might affect your imitation learning lib ;) (it's about having the right terminals)

araffin

LGTM, thanks for the fixes and the discussion =)

AdamGleave · 2021-02-27T15:57:25Z

@AdamGleave my last commit (d83cdc9) might affect your imitation learning lib ;) (it's about having the right terminals)

Thanks for flagging! Tagging @shwang FYI: we have previously been seeing unnormalized terminal observations, this PR will fix that.

Removed unneeded overrides of feature_extractor and normalize_images …

29fe04e

…in the TD3 Actor.

araffin mentioned this pull request Dec 9, 2020

Add QR-DQN Stable-Baselines-Team/stable-baselines3-contrib#13

Merged

19 tasks

araffin and others added 4 commits December 10, 2020 20:07

Add learning rate schedule example (#248)

38a4eee

* Add learning rate schedule example * Update docs/guide/examples.rst Co-authored-by: Adam Gleave <adam@gleave.me> * Address comments Co-authored-by: Adam Gleave <adam@gleave.me>

Add supported action spaces checks (#254)

674a0e1

* Add supported action spaces checks * Address comment

Use pass in an abstractmethod instead of deleting the arguments.

6e4cf83

Remove the "deterministic" keyword from the forward method of the TD3…

1c86fe9

… Actor since it always is deterministic anyways.

ernestum force-pushed the td3_review branch from 79763ee to 2aa7309 Compare December 10, 2020 19:13

Rename _get_data to _get_data_to_reconstruct_model.

b05786a

_get_data was too generic and could have meant anything.

ernestum force-pushed the td3_review branch from 2aa7309 to b05786a Compare December 10, 2020 19:56

Merge branch 'master' into td3_review

c628004

araffin reviewed Dec 13, 2020

View reviewed changes

stable_baselines3/common/policies.py Outdated Show resolved Hide resolved

Merge branch 'master' into td3_review

6c65590

ernestum added 3 commits December 30, 2020 18:13

Remove the n_episodes_rollout parameter and allow passing tuples as t…

6ca25a5

…rain_freq instead.

Merge branch 'master' into td3_review

1d68c4e

Fix docstring of train_freq parameter.

52003e0

Black fixes.

11b666e

araffin reviewed Jan 6, 2021

View reviewed changes

stable_baselines3/common/off_policy_algorithm.py Outdated Show resolved Hide resolved

araffin reviewed Jan 6, 2021

View reviewed changes

araffin added 5 commits January 8, 2021 16:15

Merge branch 'master' into td3_review

2798943

Merge branch 'master' into td3_review

6b9159a

Merge branch 'master' into td3_review

8e96208

Fix TD3 delayed update + rename _get_data()

0660140

Fix TD3 test

17dc2e5

araffin reviewed Jan 30, 2021

View reviewed changes

Normalize train_freq to a tuple in the constructor and turn the war…

a8c6917

…ning into an assert.

Make one step the default train frequency.

142ce77

araffin reviewed Jan 30, 2021

View reviewed changes

Black fixes.

a086642

Change np.bool to bool.

6da6ed2

araffin mentioned this pull request Jan 31, 2021

Add Support for Text Records to Logger, Add Hint on How To Access SummaryWriter in Docs. #303

Merged

14 tasks

araffin and others added 12 commits February 1, 2021 11:27

Merge branch 'master' into td3_review

fa308b2

Merge branch 'master' into td3_review

e5d0db6

Use the tuple format to specify an amount of steps in terms of steps …

418d57e

…or episodes in the collect_collouts of the off policy algorithm.

Use the tuple format to specify an amount of steps in terms of steps …

1110298

…or episodes in the collect_collouts of HER.

Merge branch 'master' into td3_review

aca3cf2

Use named tuple for train freq

b68e874

Merge branch 'td3_review' of github.com:DLR-RM/stable-baselines3 into…

bc4e101

… td3_review

Rename train_freq to train_every and TrainFreq to ExperienceDuration.…

1e0b335

… Also add some type annotations and documentation.

Black fixes.

044f961

Merge branch 'master' into td3_review

6cd1a99

Revert to train_freq

c60c6d6

Fix terminal observation issues

d83cdc9

araffin added 5 commits February 27, 2021 15:00

Typo

3443d73

Fix action noise bug in HER

01c5aa1

Add assert when loading HER models

8b61b0c

Merge branch 'master' into td3_review

02df792

Update version

ed3377d

araffin approved these changes Feb 27, 2021

View reviewed changes

araffin merged commit 0c50d75 into master Feb 27, 2021

araffin deleted the td3_review branch February 27, 2021 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TD3 Code review #245

TD3 Code review #245

ernestum commented Nov 28, 2020 •

edited by araffin

Loading

araffin commented Dec 13, 2020

ernestum commented Dec 30, 2020

ernestum commented Dec 30, 2020

Miffyli commented Jan 4, 2021

araffin commented Jan 6, 2021 •

edited

Loading

araffin Jan 6, 2021

araffin Jan 6, 2021

ernestum Jan 30, 2021

araffin Jan 30, 2021

ernestum Jan 30, 2021

araffin Jan 30, 2021

ernestum Jan 30, 2021

araffin Jan 30, 2021

ernestum Jan 30, 2021

araffin Jan 30, 2021

araffin Jan 30, 2021 •

edited

Loading

ernestum Jan 30, 2021

araffin Jan 30, 2021

ernestum Jan 30, 2021 •

edited

Loading

araffin Jan 30, 2021

araffin commented Jan 30, 2021

ernestum commented Jan 30, 2021

araffin commented Feb 27, 2021

araffin left a comment

AdamGleave commented Feb 27, 2021

TD3 Code review #245

TD3 Code review #245

Conversation

ernestum commented Nov 28, 2020 • edited by araffin Loading

Description

Motivation and Context

Types of changes

Checklist:

araffin commented Dec 13, 2020

ernestum commented Dec 30, 2020

ernestum commented Dec 30, 2020

Miffyli commented Jan 4, 2021

araffin commented Jan 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

araffin Jan 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ernestum Jan 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

araffin commented Jan 30, 2021

ernestum commented Jan 30, 2021

araffin commented Feb 27, 2021

araffin left a comment

Choose a reason for hiding this comment

AdamGleave commented Feb 27, 2021

ernestum commented Nov 28, 2020 •

edited by araffin

Loading

araffin commented Jan 6, 2021 •

edited

Loading

araffin Jan 30, 2021 •

edited

Loading

ernestum Jan 30, 2021 •

edited

Loading