Adopting reasonable strategies is challenging but crucial for an intelligent agent with limited resources working in hazardous, unstructured, and dynamic changing environments to improve the system utility, decrease the overall cost, and increase mission success probability. Deep Reinforcement Learning (DRL) helps organize agents' behaviors and actions based on their state and represents complex strategies (composition of actions). This project proposes a novel hierarchical strategy decomposition approach based on Bayesian chaining to separate an intricate policy into several simple sub-policies and organize their relationships as Bayesian strategy networks (BSN). We integrate this approach into the state-of-the-art DRL method, soft actor-critic (SAC), and build the corresponding Bayesian soft actor-critic (BSAC) model by organizing each sub-policies as a joint policy.
This implementation requires Anaconda / OpenAI Gym / Mujoco / PyTorch / rl-plotter.
- Install OpenAI Gym:
pip install gym
- Install Mujoco:
mkdir -p ~/.mujoco && cd ~/.mujoco
wget -P . https://www.roboti.us/download/mjpro200_linux.zip
unzip mjpro200_linux.zip
- Copy your Mujoco license key (mjkey.txt) to the path:
cp mjkey.txt ~/.mujoco
cp mjkey.txt ~/.mujoco/mujoco200_linux/bin
- Add environment variables:
export LD_LIBRARY_PATH=~/.mujoco/mujoco200/bin${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export MUJOCO_KEY_PATH=~/.mujoco${MUJOCO_KEY_PATH}
- Download mujoco-py and create conda environment:
mkdir ~/mujoco_py
cd ~/mujoco-py
git clone https://github.com/openai/mujoco-py.git
conda create -n myenv python=3.6
source activate myenv
sudo apt-get install build-essential
- Install dependence:
cd ~/mujoco-py
pip install -r requirements.txt
pip install -r requirements.dev.txt
python setup.py install
- Install reinforcement learning (RL) plotter -- rl-plotter:
pip install rl_plotter
- Hopper-V2 with 3 factors BSAC:
cd ~/hopper-v2_3bsac
pyhton3 main_bsac.py
- Walker2d-V2 with 5 factors BSAC:
cd ~/walker2d-v2_5bsac
pyhton3 main_bsac.py
- Humanoid-V2:
- 3 factors BSAC:
cd ~/humanoid-v2_3bsac
pyhton3 main_bsac.py
- 5 factors BSAC:
cd ~/humanoid-v2_5bsac
pyhton3 main_bsac.py
- 9 factors BSAC:
cd ~/humanoid-v2_9bsac
pyhton3 main_bsac.py
Note: Before running the code, please set the specific directory in files
main_bsac.py
andnetworks.py
for the data updating.
From theoretical derivation, we formulate the training process of the BSAC and implement it in OpenAI's MuJoCo standard continuous control benchmark domains such as the Hopper, Walker, and the Humanoid. The results illustrated the effectiveness of the proposed architecture in enabling the application domains with high-dimensional action spaces and can achieve higher performance against the state-of-the-art RL methods. Furthermore, we believe that the potential generality and practicability of the BSAC evoke further theoretical and empirical investigations. Especially, implementing the BSAC on real robots is not only a challenging problem but will also help us develop robust computation models for multi-agent/robot systems, such as robot locomotion control, multi-robot planning and navigation, and robot-aided search and rescue missions.