Skip to content

GSoC 2021: Stable Baseline and Ray in Deepots (DeepbotsZero project).

Nikolaos Kokkinis-Ntrenis edited this page Aug 20, 2021 · 9 revisions

Table of Contents

  1. DeepbotsZero
    1. Project
    2. Completed Work
    3. Enhancements / To be done / Optional
  2. Weekly Update
  3. Many, many thanks to my mentors!

DeepbotsZero

Mentors: @ManosMagnus, @tsampazk, @passalis

Project

My objectives for this proposal were to implement two humanoid environments for the deepbots framework, using KONDO's KHR-3HV (17 degrees of freedom) and Nao (25 degrees of freedom) as agents. In both environments, I used Robot-Supervisor scheme that exists on deeepbots framework, due to the reason for the high-dimensional observation space that exists on both robots. As the goal of each agent would be to maximize the simple objective of walking the largest possible distance.

I planned to use the Proximal Policy Optimization (PPO) [Done] and the Twin Delayed DDPG (TD3) [Futur-work]. Both are provided by the stable-baseline and RLlib codebase. The choice of the algorithms is not arbitrary, the two algorithms belong to different families (policy gradient and actor-critic) which exploit different environment properties. Moreover, PPO doesn’t need to spend a lot of time on the hyperparameter tuning and it behaves better than any other on-policy gradient-based algorithm (REINFORCE, TRPO). While the TD3 is a better implementation of DDPG using Clipped Double Q-learning, which reduces the overestimation bias on the maximization of the Q value in the loss function.

Finally, since the purpose is to have a reusable environment I dedicate a considerable amount of effort to documenting examples of usage and possible parameters that can be taken into account. I addition, I have created a docker image, allowing the user for greater ease of use of configuration enforcing consistency across different platforms.

Completed Work

Enhancements / To be done / Optional

  • Keep the same actions on the time frame of 10. Since the simulator is quite fast this might help the policy.
  • Use TD3 which has a memory buffer. Then compare the results with the PPO.
  • Extend the current API (i.e.,gym.make("CartPole-v1") from OpenAI gym) to allow other users to create new environments easily.
  • Work on accumulative learning. Such as train a policy using lots of robots and evaluate it on all of them.
  • Work on zero-shot learning. Such as train a policy on one robot and evaluate it on another.

Weekly Update

Many, many thanks to my mentors!

Thank you all for being so willingly giving me your time and guidance throughout our conversation.