Skip to content

GSoC 2021: Stable Baseline and Ray in Deepots (DeepbotsZero project).

Nikolaos Kokkinis-Ntrenis edited this page Aug 20, 2021 · 9 revisions

Table of Contents

  1. DeepbotsZero
    1. Project
    2. Completed Work
    3. Enhancements / To be done / Optional
  2. Weekly Update
  3. Many, many thanks to my mentors!

DeepbotsZero

Mentors: @ManosMagnus, @tsampazk, @passalis

Project

My objectives for this proposal were to implement two humanoid environments for the deepbots framework, using KONDO's KHR-3HV (17 degrees of freedom) and Nao (25 degrees of freedom) as agents. In both environments, I used Robot-Supervisor scheme that exists on deeepbots framework, due to the reason for the high-dimensional observation space that exists on both robots. As the goal of each agent would be to maximize the simple objective of walking the largest possible distance.

I planned to use the Proximal Policy Optimization (PPO) [Done] and the Twin Delayed DDPG (TD3) [Futur-work]. Both are provided by the stable-baseline and RLlib codebase. The choice of the algorithms is not arbitrary, the two algorithms belong to different families (policy gradient and actor-critic) which exploit different environment properties. Moreover, PPO doesn’t need to spend a lot of time on the hyperparameter tuning and it behaves better than any other on-policy gradient-based algorithm (REINFORCE, TRPO). While the TD3 is a better implementation of DDPG using Clipped Double Q-learning, which reduces the overestimation bias on the maximization of the Q value in the loss function.

Finally, since the purpose is to have a reusable environment I dedicate a considerable amount of effort to documenting examples of usage and possible parameters that can be taken into account. I addition, I have created a docker image, allowing the user for greater ease of use of configuration enforcing consistency across different platforms.

Completed Work

Enhancements / To be done / Optional

  • Keep the same actions on the time frame of 10. Since the simulator is quite fast this might help the policy.
  • Use TD3 which has a memory buffer. Then compare the results with the PPO.
  • Extend the current API (i.e.,gym.make("CartPole-v1") from OpenAI gym) to allow other users to create new environments easily.
  • Work on accumulative learning. Such as train a policy using lots of robots and evaluate it on all of them.
  • Work on zero-shot learning. Such as train a policy on one robot and evaluate it on another.

Weekly Update

  • 19/05/21 - Initial Meeting

    • Replicate panda example.
    • Initialize the KHR-3HV environment.
  • 26/05/21 - Catch up meeting

    • Make the cartpole example to be used with stable-baseline PPO #43 -Merged.
    • Verify cartpole example is compatible with Stable-baselines #43 -Merged.
    • Create a small script that moves the KHR-3HV hands Done.
    • Get the list of all the devices names (i.e head, left arm) on RobotUtil.
    • Have a normalize/denormalize util function on RobotUtil.
  • 2/06/21 - Catch up meeting

    • Keep track observations and reward functions file.
    • (Optional) Check webots-gym repo for vectorization and multiple workers.
  • 9/06/21 - Catch up meeting

    • Add scale weight on reward for the battery, expirments on file.
    • Add penalty on the reward function on the x-axis, expirments on file.
  • 16/06/21 - Catch up meeting

    • Run webots on docker #91 - Merged.

    • Run experiments with different reward functions

       reward = min(7, self.robot.getVelocity()[2]) - \
             0.005 * (np.power(self.robot.getVelocity()[2], 2) + np.power(self.robot.getVelocity()[0], 2)) - \
             0.05 *  self.robot.getPosition()[0] - \
             0.02 * np.power(np.linalg.norm(self.robot.getVelocity()),2) + 0.02   
      r = 1.26 * self.robot.getPosition()[2] 
         reward = 1.26 * self.robot.getPosition()[2] - \
                 0.02 * np.sum(np.power(self.actuators, 2)) - \
                 0.05  *  math.pow(self.robot.getPosition()[0],2)
        reward = self.robot.getVelocity()[2] + 0.0625 - \
                  50 * math.pow(self.robot.getPosition()[1],2) - \
                  0.02 * np.sum(np.power(self.actuators, 2)) - \
                  3 *  math.pow(self.robot.getPosition()[0],2)
    • Use Ray with Wandb for better logging of experiments #49.

  • 07/07/21 - Catch up meeting

    • Check actions output (if there are on the scale of -2.35 to 2.35). Note: Both Ray (tanh or relu) and SB output actions in the range of -2.35 to 2.35.
    • Use keyboard wrapper for verification.
    • Change tanh to ReLU on the Ray.
    • Check the Ray with Cuda. Note: it is possible since the Master thread is using GPU for the SGD and slave thread 1 CPU for the environmental interactions. For more CPUs or GPUs there is an error from the environment.
  • 14/07/21 - Catch up meeting

  • Change the MLP to RNN. (Not possible at stable-baseline3 PPO the use of RNN instead of MLP policy)

  • Frame window on states #49.

  • Add the velocity of the robot as a state. Velocity on axis z only #49.

  • Add on reward term: current_pos - prev_pos, wandb link

Many, many thanks to my mentors!

Thank you all for being so willingly giving me your time and guidance throughout our conversation.