Hotfix behavior descriptor #13

felixchalumeau · 2022-06-02T17:06:51Z

This pull request solves #12

Observation

We observed that when training MAP-Elites and PGAME on ant-omni with an episode length of 1000, we could only find a single behavioral niche, which is strange, knowing that with other values (like 250) we can find many more.

Explanation

This is due to in-place replacement in Brax, which makes the state descriptor and the next state descriptor equal in our transitions. Hence, at the end of the episode, the state descriptor is made equal to the following one, which corresponds to the initial state of the next episode. Hence, a robot that has been able to go far in the environment, is been evaluated as if it had not been able to move at all.

This issue was almost impossible to observe for the "uni" tasks, because in those, the behavior descriptor corresponds to the mean of the state descriptors, hence having one fake state descriptor does not have big impact on the behavioral descriptor.

Why was this issue discrete?

When creating the environments in the notebooks of the repo, we did not pass any episode length to the create function. By default, the episode length was hence given the value 1000 by the create function. This did not seem to be an issue because the scoring function uses the parameter episode_length to roll the desired number of times in the environment. At the end, the environment does not consider that the simulation is done, hence do not reset, so the next state descriptor is not the initial state, it is a state descriptor that is just near. The issue is that, after 1000 accumulated rollouts in the environment, the environment considers that it must now be done and hence stops, and automatically resets. Hence, the next state descriptor corresponds to the state descriptor of the initial state, so the bug has a big impact and appears clearly in the results.

Example: on ant omni, for an episode length equal to 200, the bug (bad behavior descriptor evaluation) won't be visible on the 4 first evaluations (technically, there is a bad evaluation but it is very little) but on the fifth evaluation, the environment will reach the 1000 steps, apply the auto reset and the bug will be visible. Hence, 20% of the evaluation are strongly incorrect.

Fix proposed

first, we explicitely give the episode length when creating the environment
second, in the play step function, we store the state descriptor before applying the env.step function so there is no issue with the in-place replacement made in Brax

What's next?
We are thinking about ways to avoid any potential mistake with the state descriptor in-place replacement made by Brax. One solution currently discussed being to create a QDState that would have state_descriptor as an attribute. Hence, it would be stored as an attribute rather than in a dictionary.

felixchalumeau added 5 commits May 31, 2022 09:58

fix(links): update links in README and notebooks

6051206

Merge remote-tracking branch 'upstream/main' into main

47eb535

fix: udpate notebook to fix issue with bd evaluation

330743a

quick minor fix in the notebook

696ff88

reduce episode length and iterations in pgame notebook

a83b9b9

felixchalumeau marked this pull request as ready for review June 7, 2022 16:55

felixchalumeau merged commit ffebaa1 into main Jun 8, 2022

felixchalumeau mentioned this pull request Jun 8, 2022

Bad evaluation of the behavior descriptor in MAP-Elites and PGAME #12

Closed

limbryan deleted the hotfix-behavior-descriptor branch July 4, 2022 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hotfix behavior descriptor #13

Hotfix behavior descriptor #13

felixchalumeau commented Jun 2, 2022 •

edited

Loading

Hotfix behavior descriptor #13

Hotfix behavior descriptor #13

Conversation

felixchalumeau commented Jun 2, 2022 • edited Loading

felixchalumeau commented Jun 2, 2022 •

edited

Loading