Port in torchified PPO from sb3_contrib #5

rhaps0dy · 2023-09-21T03:44:13Z

I started by copying over the numpy code and then fixing whatever broke. This led to working code, unlike previous iterations. I also added the PPORecurrent algorithm to a bunch of existing tests, which is the same thing that sb3_contrib does.

This PR contains lots of Mypy type errors. I want to fix them when I make things generic over the hidden state, as in #4, and not now. If you check the CircleCI job you'll see that the tests themselves pass.

Other changes:

Increased test parallelism in CircleCI to make tests less slow (there are some slow new tests)

dan-pandori · 2023-09-21T17:39:52Z

Can you add some lineage of what was copied from where? This is a big PR so it's going to take me a while to review, and understanding what is new code vs copied code would be helpful.

dan-pandori

Added some quick comments to the smaller files. Waiting to hear back on the sources of the larger files before reviewing them.

Thanks for the refactors in the tests to treating OnPolicy/OffPolicy algorithms separately rather than using sets, seems much cleaner now!

.circleci/config.yml

stable_baselines3/common/buffers.py

dan-pandori · 2023-09-21T17:59:04Z

stable_baselines3/common/buffers.py

@@ -457,7 +457,7 @@ def compute_returns_and_advantage(self, last_values: th.Tensor, dones: th.Tensor
                next_non_terminal = ~dones
                next_values = last_values
            else:
-                next_non_terminal = 1.0 - self.episode_starts[step + 1]
+                next_non_terminal = ~self.episode_starts[step + 1]


next_non_terminal seems like a confusing name here. Thoughts on is_next_non_terminal?

next_non_terminal reads to me as referring to the next nonterminal value, but this variable seems to be referring to whether the following value to be iterated through is nonterminal.

dan-pandori

LGTM, thanks for splitting out the other parts!

dan-pandori · 2023-09-22T00:13:25Z

tests/test_buffers.py

    }[replay_buffer_cls]
    env = make_vec_env(env)

-    buffer = replay_buffer_cls(100, env.observation_space, env.action_space, device=device)
+    if replay_buffer_cls == RecurrentRolloutBuffer:
+        buffer = RecurrentRolloutBuffer(


Optional:
Here and below, 100 comes off as a bit of a magic number that it's hard to see why it was picked (and what it does). Passing all these in as kwargs would make it a little more readable.

If this 100 is the 'same' 100 that is used in line 153, it would be good to refactor them to use a shared constant.

rhaps0dy added 26 commits September 15, 2023 14:19

Bring over recurrent policies from sb3_contrib

36f3de6

fix type annotaitons

d9ac3f8

A recurrent buffer candidate

1eb682d

Recurrent buffer using pytrees

8180318

Buffer passes the most basic test

1596f9c

Fixed typing in ppo_recurrent

a190a7a

Basic tests pass but still not properly recurrent

71f7904

no more circular imports

554a74c

Introduce MlpPolicy and stuff

cb1e271

OnPolicyAlgorihtm in some places

55c1b62

attempt more patches

ef27de4

which tests pass now?

5fbb1e4

attempt to fix pytype

522096a

Store actions as long directly, fewer shape modifications

8190d78

Fix spaces/shapes, increase parallelism

415dcb0

Batch size can't be None

e16f008

Unsqueeze and re-squeeze around calling policy

04b990b

Start from Numpy

3388a87

Fix lack of data

44bc3b5

Some tweaking

fd7f9f0

No swapaxes overwrite

2c8974d

Basic torchification of buffers

2eae72a

Torchify policies and PPORecurrent

13828a4

Don't use numpy in poilcy predict

18b9f41

The pytree that's actually used

43a04a7

Make the silly test pass

d89d269

rhaps0dy requested a review from dan-pandori September 21, 2023 03:55

dan-pandori reviewed Sep 21, 2023

View reviewed changes

rhaps0dy changed the base branch from master to just-copy-contrib September 21, 2023 21:51

Merge branch 'just-copy-contrib' into start-from-numpy

90d614e

rhaps0dy mentioned this pull request Sep 21, 2023

Just copy the contrib files #6

Merged

rhaps0dy added 4 commits September 21, 2023 14:57

Are tests actually faster?

a6f9ed3

next_is_non_terminal

dc8932a

use sb3_namespace by default

7d46e47

Merge branch 'just-copy-contrib' into start-from-numpy

3174a97

rhaps0dy requested a review from dan-pandori September 21, 2023 22:13

rhaps0dy added 2 commits September 21, 2023 15:51

correct importing

30ccd41

Merge branch 'master' into start-from-numpy

e30625d

rhaps0dy changed the base branch from just-copy-contrib to master September 22, 2023 00:02

dan-pandori approved these changes Sep 22, 2023

View reviewed changes

Make 100 not a magic number

fdc4370

rhaps0dy merged commit fc9b730 into main Oct 7, 2023
0 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port in torchified PPO from sb3_contrib #5

Port in torchified PPO from sb3_contrib #5

rhaps0dy commented Sep 21, 2023 •

edited

Loading

dan-pandori commented Sep 21, 2023

dan-pandori left a comment

dan-pandori Sep 21, 2023

dan-pandori left a comment

dan-pandori Sep 22, 2023

Port in torchified PPO from sb3_contrib #5

Port in torchified PPO from sb3_contrib #5

Conversation

rhaps0dy commented Sep 21, 2023 • edited Loading

dan-pandori commented Sep 21, 2023

dan-pandori left a comment

Choose a reason for hiding this comment

dan-pandori Sep 21, 2023

Choose a reason for hiding this comment

dan-pandori left a comment

Choose a reason for hiding this comment

dan-pandori Sep 22, 2023

Choose a reason for hiding this comment

rhaps0dy commented Sep 21, 2023 •

edited

Loading