-
Notifications
You must be signed in to change notification settings - Fork 50
GSoC 2021: Stable Baseline and Ray in Deepots (DeepbotsZero project).
Mentors: @ManosMagnus, @tsampazk, @passalis
My objectives for this proposal were to implement two humanoid environments for the deepbots framework, using KONDO's KHR-3HV (17 degrees of freedom) and Nao (25 degrees of freedom) as agents. In both environments, I used Robot-Supervisor scheme that exists on deeepbots framework, due to the reason for the high-dimensional observation space that exists on both robots. As the goal of each agent would be to maximize the simple objective of walking the largest possible distance.
I planned to use the Proximal Policy Optimization (PPO) [Done] and the Twin Delayed DDPG (TD3) [Futur-work]. Both are provided by the stable-baseline and RLlib codebase. The choice of the algorithms is not arbitrary, the two algorithms belong to different families (policy gradient and actor-critic) which exploit different environment properties. Moreover, PPO doesn’t need to spend a lot of time on the hyperparameter tuning and it behaves better than any other on-policy gradient-based algorithm (REINFORCE, TRPO). While the TD3 is a better implementation of DDPG using Clipped Double Q-learning, which reduces the overestimation bias on the maximization of the Q value in the loss function.
Finally, since the purpose is to have a reusable environment I dedicate a considerable amount of effort to documenting examples of usage and possible parameters that can be taken into account. I addition, I have created a docker image, allowing the user for greater ease of use of configuration enforcing consistency across different platforms.
- Keep the same actions on the time frame of 10. Since the simulator is quite fast this might help the policy.
- Use TD3 which has a memory buffer. Then compare the results with the PPO.
- Extend the current API (i.e.,
gym.make("CartPole-v1")
from OpenAI gym) to allow other users to create new environments easily. - Work on accumulative learning. Such as train a policy using lots of robots and evaluate it on all of them.
- Work on zero-shot learning. Such as train a policy on one robot and evaluate it on another.
Thank you all for being so willingly giving me your time and guidance throughout our conversation.