Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hotfix behavior descriptor #13

Merged
merged 5 commits into from
Jun 8, 2022
Merged

Conversation

felixchalumeau
Copy link
Collaborator

@felixchalumeau felixchalumeau commented Jun 2, 2022

This pull request solves #12

Observation

We observed that when training MAP-Elites and PGAME on ant-omni with an episode length of 1000, we could only find a single behavioral niche, which is strange, knowing that with other values (like 250) we can find many more.

Explanation

This is due to in-place replacement in Brax, which makes the state descriptor and the next state descriptor equal in our transitions. Hence, at the end of the episode, the state descriptor is made equal to the following one, which corresponds to the initial state of the next episode. Hence, a robot that has been able to go far in the environment, is been evaluated as if it had not been able to move at all.

This issue was almost impossible to observe for the "uni" tasks, because in those, the behavior descriptor corresponds to the mean of the state descriptors, hence having one fake state descriptor does not have big impact on the behavioral descriptor.

Why was this issue discrete?

When creating the environments in the notebooks of the repo, we did not pass any episode length to the create function. By default, the episode length was hence given the value 1000 by the create function. This did not seem to be an issue because the scoring function uses the parameter episode_length to roll the desired number of times in the environment. At the end, the environment does not consider that the simulation is done, hence do not reset, so the next state descriptor is not the initial state, it is a state descriptor that is just near. The issue is that, after 1000 accumulated rollouts in the environment, the environment considers that it must now be done and hence stops, and automatically resets. Hence, the next state descriptor corresponds to the state descriptor of the initial state, so the bug has a big impact and appears clearly in the results.

Example: on ant omni, for an episode length equal to 200, the bug (bad behavior descriptor evaluation) won't be visible on the 4 first evaluations (technically, there is a bad evaluation but it is very little) but on the fifth evaluation, the environment will reach the 1000 steps, apply the auto reset and the bug will be visible. Hence, 20% of the evaluation are strongly incorrect.

Fix proposed

  • first, we explicitely give the episode length when creating the environment
  • second, in the play step function, we store the state descriptor before applying the env.step function so there is no issue with the in-place replacement made in Brax

What's next?
We are thinking about ways to avoid any potential mistake with the state descriptor in-place replacement made by Brax. One solution currently discussed being to create a QDState that would have state_descriptor as an attribute. Hence, it would be stored as an attribute rather than in a dictionary.

@felixchalumeau felixchalumeau marked this pull request as ready for review June 7, 2022 16:55
@felixchalumeau felixchalumeau merged commit ffebaa1 into main Jun 8, 2022
@limbryan limbryan deleted the hotfix-behavior-descriptor branch July 4, 2022 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant