GSoC 2021: Stable Baseline and Ray in Deepots (DeepbotsZero project).

DeepbotsZero

Mentors: @ManosMagnus, @tsampazk, @passalis

Project

My objectives for this proposal were to implement two humanoid environments for the deepbots framework, using KONDO's KHR-3HV (17 degrees of freedom) and Nao (25 degrees of freedom) as agents. In both environments, I used Robot-Supervisor scheme that exists on deeepbots framework, due to the reason for the high-dimensional observation space that exists on both robots. As the goal of each agent would be to maximize the simple objective of walking the largest possible distance.

I planned to use the Proximal Policy Optimization (PPO) [Done] and the Twin Delayed DDPG (TD3) [Futur-work]. Both are provided by the stable-baseline and RLlib codebase. The choice of the algorithms is not arbitrary, the two algorithms belong to different families (policy gradient and actor-critic) which exploit different environment properties. Moreover, PPO doesn’t need to spend a lot of time on the hyperparameter tuning and it behaves better than any other on-policy gradient-based algorithm (REINFORCE, TRPO). While the TD3 is a better implementation of DDPG using Clipped Double Q-learning, which reduces the overestimation bias on the maximization of the Q value in the loss function.

Finally, since the purpose is to have a reusable environment I dedicate a considerable amount of effort to documenting examples of usage and possible parameters that can be taken into account. I addition, I have created a docker image, allowing the user for greater ease of use of configuration enforcing consistency across different platforms.

Completed Work

Enhancements / To be done / Optional

Keep the same actions on the time frame of 10. Since the simulator is quite fast this might help the policy.
Use TD3 which has a memory buffer. Then compare the results with the PPO.
Extend the current API (i.e.,gym.make("CartPole-v1") from OpenAI gym) to allow other users to create new environments easily.
Work on accumulative learning. Such as train a policy using lots of robots and evaluate it on all of them.
Work on zero-shot learning. Such as train a policy on one robot and evaluate it on another.

Weekly Update

19/05/21 - Initial Meeting
- Replicate panda example.
- Initialize the KHR-3HV environment.
26/05/21 - Catch up meeting
- Make the cartpole example to be used with stable-baseline PPO #43 -Merged.
- Verify cartpole example is compatible with Stable-baselines #43 -Merged.
- Create a small script that moves the KHR-3HV hands Done.
- Get the list of all the devices names (i.e head, left arm) on RobotUtil.
- Have a normalize/denormalize util function on RobotUtil.
2/06/21 - Catch up meeting
- Keep track observations and reward functions file.
- (Optional) Check webots-gym repo for vectorization and multiple workers.
9/06/21 - Catch up meeting
- Add scale weight on reward for the battery, expirments on file.
- Add penalty on the reward function on the x-axis, expirments on file.

16/06/21 - Catch up meeting

Run webots on docker #91 - Merged.

Run experiments with different reward functions

Mujoco reward,

 reward = min(7, self.robot.getVelocity()[2]) - \
       0.005 * (np.power(self.robot.getVelocity()[2], 2) + np.power(self.robot.getVelocity()[0], 2)) - \
       0.05 *  self.robot.getPosition()[0] - \
       0.02 * np.power(np.linalg.norm(self.robot.getVelocity()),2) + 0.02

Custom reward 1term.

r = 1.26 * self.robot.getPosition()[2]

Custom reward all terms.

   reward = 1.26 * self.robot.getPosition()[2] - \
           0.02 * np.sum(np.power(self.actuators, 2)) - \
           0.05  *  math.pow(self.robot.getPosition()[0],2)

Video reward, the video which illustrate this reward.

  reward = self.robot.getVelocity()[2] + 0.0625 - \
            50 * math.pow(self.robot.getPosition()[1],2) - \
            0.02 * np.sum(np.power(self.actuators, 2)) - \
            3 *  math.pow(self.robot.getPosition()[0],2)

Use Ray with Wandb for better logging of experiments #49.

07/07/21 - Catch up meeting
- Check actions output (if there are on the scale of -2.35 to 2.35). Note: Both Ray (tanh or relu) and SB output actions in the range of -2.35 to 2.35.
- Use keyboard wrapper for verification.
- Change tanh to ReLU on the Ray.
- Check the Ray with Cuda. Note: it is possible since the Master thread is using GPU for the SGD and slave thread 1 CPU for the environmental interactions. For more CPUs or GPUs there is an error from the environment.
14/07/21 - Catch up meeting
Change the MLP to RNN. (Not possible at stable-baseline3 PPO the use of RNN instead of MLP policy)
Frame window on states #49.
Add the velocity of the robot as a state. Velocity on axis z only #49.
Add on reward term: current_pos - prev_pos, wandb link

Many, many thanks to my mentors!

Thank you all for being so willingly giving me your time and guidance throughout our conversation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly