This repository contains the implementation of a Proximal Policy Optimization (PPO) agent to control a humanoid in the OpenAI Gymnasium Mujoco environment. The agent is trained to master complex humanoid locomotion using deep reinforcement learning.
Here is a demonstration of the agent's performance after training for 3000 epochs on the Humanoid-v4 environment.
To get started with this project, follow these steps:
-
Clone the Repository:
git clone https://github.com/ProfessorNova/PPO-Humanoid.git cd PPO-Humanoid
-
Set Up Python Environment: Make sure you have Python installed (tested with Python 3.10.11).
-
Install Dependencies: Run the following command to install the required packages:
pip install -r requirements.txt
For proper PyTorch installation, visit pytorch.org and follow the instructions based on your system configuration.
-
Install Gymnasium Mujoco: You need to install the Mujoco environment to simulate the humanoid:
pip install gymnasium[mujoco]
-
Train the Model: To start training the model, run:
python train.py --run-name "my_run"
To train using a GPU, add the
--cuda
flag:python train.py --run-name "my_run" --cuda
-
Monitor Training Progress: You can monitor the training progress by viewing the videos in the
videos
folder or by looking at the graphs in TensorBoard:tensorboard --logdir "logs"
This project implements a reinforcement learning agent using the Proximal Policy Optimization (PPO) algorithm, a popular method for continuous control tasks. The agent is designed to learn how to control a humanoid robot in a simulated environment.
- Agent: The core neural network model that outputs both policy (action probabilities) and value estimates.
- Environment: The Humanoid-v4 environment from the Gymnasium Mujoco suite, which provides a realistic physics simulation for testing control algorithms.
- Buffer: A class for storing trajectories (observations, actions, rewards, etc.) that the agent collects during interaction with the environment. This data is later used to calculate advantages and train the model.
- Training Script: The
train.py
script handles the training loop, including collecting data, updating the model, and logging results.
You can customize the training by modifying the command-line arguments:
--n-envs
: Number of environments to run in parallel (default: 48).--n-epochs
: Number of epochs to train the model (default: 3000).--n-steps
: Number of steps per environment per epoch (default: 1024).--batch-size
: Batch size for training (default: 8192).--train-iters
: Number of training iterations per epoch (default: 20).
For example:
python train.py --run-name "experiment_1" --n-envs 64 --batch-size 4096 --train-iters 30 --cuda
All hyperparameters can be viewed either with python train.py --help
or by looking at the
parse_args() function in train.py
.
Here are the specifications of the system used for training:
- CPU: AMD Ryzen 9 5900X
- GPU: Nvidia RTX 3080 (12GB VRAM)
- RAM: 64GB DDR4
- OS: Windows 11
The training process took about 5 hours to complete 3000 epochs on the Humanoid-v4 environment.
The hyperparameters used for training are as follows:
param | value |
---|---|
run_name | baseline |
cuda | True |
env | Humanoid-v4 |
n_envs | 48 |
n_epochs | 3000 |
n_steps | 1024 |
batch_size | 8192 |
train_iters | 20 |
gamma | 0.995 |
gae_lambda | 0.98 |
clip_ratio | 0.1 |
ent_coef | 1e-05 |
vf_coef | 1.0 |
learning_rate | 0.0003 |
learning_rate_decay | 0.999 |
max_grad_norm | 1.0 |
reward_scale | 0.005 |
render_epoch | 50 |
save_epoch | 200 |
The following charts provide insights into the performance during training:
-
As seen in the chart, the agent's average reward is still increasing after 3000 epochs, indicating that the agent has not yet reached its full potential and could benefit from further training.
-
In the chart above, the value loss first increases and then decreases until it plateaus after 100M steps. This behavior is expected as the agent first explores the environment and then learns to predict the value of states more accurately.