-
Notifications
You must be signed in to change notification settings - Fork 50
GSoC 2021: Stable Baseline and Ray in Deepots (DeepbotsZero project).
Mentors: @ManosMagnus, @tsampazk, @passalis
My objectives for this proposal were to implement two humanoid environments for the deepbots framework, using KONDO's KHR-3HV (17 degrees of freedom) and Nao (25 degrees of freedom) as agents. In both environments, I used Robot-Supervisor scheme that exists on deeepbots framework, due to the reason for the high-dimensional observation space that exists on both robots. As the goal of each agent would be to maximize the simple objective of walking the largest possible distance.
I planned to use the Proximal Policy Optimization (PPO) [Done] and the Twin Delayed DDPG (TD3) [Futur-work]. Both are provided by the stable-baseline and RLlib codebase. The choice of the algorithms is not arbitrary, the two algorithms belong to different families (policy gradient and actor-critic) which exploit different environment properties. Moreover, PPO doesn’t need to spend a lot of time on the hyperparameter tuning and it behaves better than any other on-policy gradient-based algorithm (REINFORCE, TRPO). While the TD3 is a better implementation of DDPG using Clipped Double Q-learning, which reduces the overestimation bias on the maximization of the Q value in the loss function.
Finally, since the purpose is to have a reusable environment I dedicate a considerable amount of effort to documenting examples of usage and possible parameters that can be taken into account. I addition, I have created a docker image, allowing the user for greater ease of use of configuration enforcing consistency across different platforms.
- Make the cartpole example to be used with stable-baseline PPO #43 -Merged.
- Create a docker image for deepbot framework, which combine Stable-Baseline and Ray framework #91 - Merged.
- Finalize the khr-3hv environment #49, which
- Works with Ray and Stable-Baseline PPO,
- Logging on Wandb website,
- Having step-by-step documentation,
- Finalize the Nao environment , which
- Works with Ray and Stable-Baseline PPO,
- Logging on Wandb website,
- Having step-by-step documentation,
- Keep the same actions on the time frame of 10. Since the simulator is quite fast this might help the policy.
- Use TD3 which has a memory buffer. Then compare the results with the PPO.
- Extend the current API (i.e.,
gym.make("CartPole-v1")
from OpenAI gym) to allow other users to create new environments easily. - Work on accumulative learning. Such as train a policy using lots of robots and evaluate it on all of them.
- Work on zero-shot learning. Such as train a policy on one robot and evaluate it on another.
-
19/05/21 - Initial Meeting
-
26/05/21 - Catch up meeting
- Make the cartpole example to be used with stable-baseline PPO #43 -Merged.
- Verify cartpole example is compatible with Stable-baselines #43 -Merged.
- Create a small script that moves the KHR-3HV hands Done.
- Get the list of all the devices names (i.e head, left arm) on RobotUtil.
- Have a normalize/denormalize util function on RobotUtil.
-
2/06/21 - Catch up meeting
- Keep track observations and reward functions file.
- (Optional) Check webots-gym repo for vectorization and multiple workers.
-
9/06/21 - Catch up meeting
-
16/06/21 - Catch up meeting
-
Run webots on docker #91 - Merged.
-
Run experiments with different reward functions
reward = min(7, self.robot.getVelocity()[2]) - \ 0.005 * (np.power(self.robot.getVelocity()[2], 2) + np.power(self.robot.getVelocity()[0], 2)) - \ 0.05 * self.robot.getPosition()[0] - \ 0.02 * np.power(np.linalg.norm(self.robot.getVelocity()),2) + 0.02
r = 1.26 * self.robot.getPosition()[2]
reward = 1.26 * self.robot.getPosition()[2] - \ 0.02 * np.sum(np.power(self.actuators, 2)) - \ 0.05 * math.pow(self.robot.getPosition()[0],2)
- Video reward, the video which illustrate this reward.
reward = self.robot.getVelocity()[2] + 0.0625 - \ 50 * math.pow(self.robot.getPosition()[1],2) - \ 0.02 * np.sum(np.power(self.actuators, 2)) - \ 3 * math.pow(self.robot.getPosition()[0],2)
-
Use Ray with Wandb for better logging of experiments #49.
-
-
07/07/21 - Catch up meeting
- Check actions output (if there are on the scale of -2.35 to 2.35). Note: Both Ray (tanh or relu) and SB output actions in the range of -2.35 to 2.35.
- Use keyboard wrapper for verification.
- Change tanh to ReLU on the Ray.
- Check the Ray with Cuda. Note: it is possible since the Master thread is using GPU for the SGD and slave thread 1 CPU for the environmental interactions. For more CPUs or GPUs there is an error from the environment.
-
14/07/21 - Catch up meeting
- Change the MLP to RNN. (Not possible at stable-baseline3 PPO the use of RNN instead of MLP policy)
- Frame window on states #49.
- Add the velocity of the robot as a state. Velocity on axis z only #49.
- Add on reward term: current_pos - prev_pos, wandb link
Thank you all for being so willingly giving me your time and guidance throughout our conversation.