Skip to content

GSoC 2021: Stable Baseline and Ray in Deepots (DeepbotsZero project).

Nikolaos Kokkinis-Ntrenis edited this page Aug 23, 2021 · 9 revisions

Table of Contents

  1. DeepbotsZero
    1. Project
    2. Completed Work
    3. Enhancements / To be done / Optional
  2. Weekly Update
  3. Many, many thanks to my mentors!

DeepbotsZero

Mentors: @ManosMagnus, @tsampazk, @passalis

Project

My objectives for this proposal were to implement two humanoid environments for the deepbots framework, using KONDO's KHR-3HV (17 degrees of freedom) and Nao (25 degrees of freedom) as agents. In both environments, I used Robot-Supervisor scheme that exists on deeepbots framework, due to the reason for the high-dimensional observation space that exists on both robots. As the goal of each agent would be to maximize the simple objective of walking the largest possible distance.

I planned to use the Proximal Policy Optimization (PPO) [Done] and the Twin Delayed DDPG (TD3) [Futur-work]. Both are provided by the stable-baseline and RLlib codebase. The choice of the algorithms is not arbitrary, the two algorithms belong to different families (policy gradient and actor-critic) which exploit different environment properties. Moreover, PPO doesn’t need to spend a lot of time on the hyperparameter tuning and it behaves better than any other on-policy gradient-based algorithm (REINFORCE, TRPO). While the TD3 is a better implementation of DDPG using Clipped Double Q-learning, which reduces the overestimation bias on the maximization of the Q value in the loss function.

Finally, since the purpose is to have a reusable environment I dedicate a considerable amount of effort to documenting examples of usage and possible parameters that can be taken into account. I addition, I have created a docker image, allowing the user for greater ease of use of configuration enforcing consistency across different platforms.

Note:

  1. Stable Baselines is an open-source library that contains a set of improved implementations of Reinforcement Learning (RL) algorithms based on OpenAI Baselines.
  2. RLlib is an open-source library for reinforcement learning, based on Ray, that offers both high scalability and a unified API for a variety of applications.

We could consider both of them as python libraries that we install in our system and then we could use the implemented RL algorithms just by calling them as a function.

  1. Weights & Biases (wandb) take care of the tracking and visualizing of ML model metrics.

Completed Work

Make the cartpole example to be used with stable-baseline PPO #43 -Merged.

I have worked in an easy, already implemented, environment to understand better the Webot and the Robot-Supervisor scheme from deepbots. I had used the PPO algorithm from Stable-Baselines for training on the cartpole agent.

Create a docker image for deepbot framework, which combine Stable-Baseline and Ray framework #91 - Merged.

I have created a docker image, for greater ease of use of the configuration of future users. Also, I was needed the docker image to run experiments on my University Cluster. Due to the reason, that is not allowed to install any program on the cluster and I have to provide a docker image, for each of the experiments that I would like to run.

Finalize the khr-3hv environment #49

khr-3hv

For the use of the RL algorithm, we need a descriptive observation space which is the input of the neural net. Then the neural net decides the actions based on the current state of the agent. We were searching for the best observation and action space that could help the khr-3hv robot learn to walk further away. The latest observation space that has been used is defined in the table below. Also, we are using a time window of the 5, on the observation space. Like that we keep the last 5 observations that have occurred on each episode. So, the total observation space is 5 * 10 = 50.

Num Observation Min Max
0 Robot position x-axis -Inf Inf
1 Robot position y-axis -Inf Inf
2 Robot position z-axis -Inf Inf
3 Robot velocity z-axis 0 +7
4 LeftAnkle -inf +inf
5 LeftCrus -inf +inf
6 LeftFemur -inf +inf
7 RightAnkle -inf +inf
8 RightCrus -inf +inf
9 RightFemur -inf +inf

The action space of the agent is defined in the table below:

Num actuators Min Max
0 LeftAnkle -2.35 +2.35
1 LeftCrus -2.35 +2.35
2 LeftFemur -2.35 +2.35
3 RightAnkle -2.35 +2.35
4 RightCrus -2.35 +2.35
5 RightFemur -2.35 +2.35

The reward function that we are currently using is:

reward = 2.5 * self.robot.getPosition()[2] + self.robot.getPosition()[2] - self.prev_pos 

reward-plot

which expressing to the agent to walk further away on the z-axis (forward move) and on each timestep to be further away from its previous position.

This environment:

  • Works with Ray and Stable-Baseline PPO,
  • Use logging on Wandb website,
  • Having step-by-step documentation,

Finalize the Nao environment #49

Nao

The observation space for the Nao environment is defined in the table below. It is, also, uses a time window of the 5, on the observation space. Like that we keep the last 5 observations that have occurred on each episode. So, the total observation space is 5 * 10 = 50.

Num Observation Min Max
0 Robot position x-axis -Inf Inf
1 Robot position y-axis -Inf Inf
2 Robot position z-axis -Inf Inf
3 Robot velocity z-axis 0 +6.4
4 LAnklePitch -inf +inf
5 LHipPitch -inf +inf
6 LKneePitch -inf +inf
7 RAnklePitch -inf +inf
8 RHipPitch -inf +inf
9 RKneePitch -inf +inf

The action space of the agent is defined in the table below:

Num actuators Min Max
0 LAnklePitch -1.18 +0.91
1 LHipPitch -1.76 +0.47
2 LKneePitch -0.08 +2.10
3 RAnklePitch -1.18 +0.91
4 RHipPitch -1.76 +0.47
5 RKneePitch -0.08 +2.10

The reward function that we are currently using is:

reward = 2.5 * self.robot.getPosition()[2] + self.robot.getPosition()[2] - self.prev_pos 

reward-plot

which expressing to the agent to walk further away on the z-axis (forward move) and on each timestep to be further away from its previous position.

This environment:

  • Works with Ray and Stable-Baseline PPO,
  • Logging on Wandb website,
  • Having step-by-step documentation,

Enhancements / To be done / Optional

  • On Nao environment, better tuning of the hyperparameters is needed.
  • Keep the same actions on the time frame of 10. Since the simulator is quite fast this might help the policy.
  • Use TD3 which has a memory buffer. Then compare the results with the PPO.
  • Extend the current API (i.e.,gym.make("CartPole-v1") from OpenAI gym) to allow other users to create new environments easily.
  • Work on accumulative learning. Such as train a policy using lots of robots and evaluate it on all of them.
  • Work on zero-shot learning. Such as train a policy on one robot and evaluate it on another.

Weekly Update

  • 19/05/21 - Initial Meeting

    • Replicate panda example.
    • Initialize the KHR-3HV environment.
  • 26/05/21 - Catch up meeting

    • Make the cartpole example to be used with stable-baseline PPO #43 -Merged.
    • Verify cartpole example is compatible with Stable-baselines #43 -Merged.
    • Create a small script that moves the KHR-3HV hands Done.
    • Get the list of all the devices names (i.e head, left arm) on RobotUtil.
    • Have a normalize/denormalize util function on RobotUtil.
  • 2/06/21 - Catch up meeting

    • Keep track observations and reward functions file.
    • (Optional) Check webots-gym repo for vectorization and multiple workers.
  • 9/06/21 - Catch up meeting

    • Add scale weight on reward for the battery, expirments on file.
    • Add penalty on the reward function on the x-axis, expirments on file.
  • 16/06/21 - Catch up meeting

    • Run webots on docker #91 - Merged.

    • Run experiments with different reward functions

       reward = min(7, self.robot.getVelocity()[2]) - \
             0.005 * (np.power(self.robot.getVelocity()[2], 2) + np.power(self.robot.getVelocity()[0], 2)) - \
             0.05 *  self.robot.getPosition()[0] - \
             0.02 * np.power(np.linalg.norm(self.robot.getVelocity()),2) + 0.02   
      r = 1.26 * self.robot.getPosition()[2] 
         reward = 1.26 * self.robot.getPosition()[2] - \
                 0.02 * np.sum(np.power(self.actuators, 2)) - \
                 0.05  *  math.pow(self.robot.getPosition()[0],2)
        reward = self.robot.getVelocity()[2] + 0.0625 - \
                  50 * math.pow(self.robot.getPosition()[1],2) - \
                  0.02 * np.sum(np.power(self.actuators, 2)) - \
                  3 *  math.pow(self.robot.getPosition()[0],2)
    • Use Ray with Wandb for better logging of experiments #49.

  • 07/07/21 - Catch up meeting

    • Check actions output (if there are on the scale of -2.35 to 2.35 on the khr-3hv robot). Note: Both Ray (tanh or relu) and SB output actions in the range of -2.35 to 2.35.
    • Use keyboard wrapper for verification.
    • Change tanh to ReLU on the Ray.
    • Check the Ray with Cuda. Note: it is possible since the Master thread is using GPU for the SGD and slave thread 1 CPU for the environmental interactions. For more CPUs or GPUs there is an error from the environment.
  • 14/07/21 - Catch up meeting

    • Initialize Nao
    • Change the MLP to RNN. (Not possible at stable-baseline3 PPO the use of RNN instead of MLP policy)
    • Frame window on states #49.
    • Add the velocity of the robot as a state. Velocity on axis z only #49.
    • Add on reward term: current_pos - prev_pos, khr-3hv, Nao

Many, many thanks to my mentors!

Thank you all for being so willingly giving me your time and guidance throughout our conversation.