Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request solves #12
Observation
We observed that when training MAP-Elites and PGAME on ant-omni with an episode length of 1000, we could only find a single behavioral niche, which is strange, knowing that with other values (like 250) we can find many more.
Explanation
This is due to in-place replacement in Brax, which makes the
state descriptor
and thenext state descriptor
equal in our transitions. Hence, at the end of the episode, the state descriptor is made equal to the following one, which corresponds to the initial state of the next episode. Hence, a robot that has been able to go far in the environment, is been evaluated as if it had not been able to move at all.This issue was almost impossible to observe for the "uni" tasks, because in those, the behavior descriptor corresponds to the mean of the state descriptors, hence having one fake state descriptor does not have big impact on the behavioral descriptor.
Why was this issue discrete?
When creating the environments in the notebooks of the repo, we did not pass any episode length to the
create
function. By default, the episode length was hence given the value 1000 by the create function. This did not seem to be an issue because the scoring function uses the parameterepisode_length
to roll the desired number of times in the environment. At the end, the environment does not consider that the simulation is done, hence do not reset, so the next state descriptor is not the initial state, it is a state descriptor that is just near. The issue is that, after 1000 accumulated rollouts in the environment, the environment considers that it must now be done and hence stops, and automatically resets. Hence, thenext state descriptor
corresponds to the state descriptor of the initial state, so the bug has a big impact and appears clearly in the results.Example: on ant omni, for an episode length equal to 200, the bug (bad behavior descriptor evaluation) won't be visible on the 4 first evaluations (technically, there is a bad evaluation but it is very little) but on the fifth evaluation, the environment will reach the 1000 steps, apply the auto reset and the bug will be visible. Hence, 20% of the evaluation are strongly incorrect.
Fix proposed
env.step
function so there is no issue with the in-place replacement made in BraxWhat's next?
We are thinking about ways to avoid any potential mistake with the state descriptor in-place replacement made by Brax. One solution currently discussed being to create a QDState that would have state_descriptor as an attribute. Hence, it would be stored as an attribute rather than in a dictionary.