Skip to content

A webserver-based parallel RL training framework implemented with pytorch, pb, flask, and libshm

Notifications You must be signed in to change notification settings

ChenShawn/webserver-based-parallel-rl-training

Repository files navigation

webserver-based-parallel-rl-training

A simplified scalable distributed RL training framework, supporting asynchronous RL training with arbitrary number of rollout workers and replay memory servers.

Implemented with:

Still on developing...

Usage

Start training:

# NOTE: Use `bash` to activate the scripts instead of `sh`, which is a link of dash and may have potential bugs
bash start.sh

start.sh will create several folders under the current working directory, with their names starting with either running_rollout_ or running_worker_, which correspond to mempool server processes and worker processes, respectively.

Terminate training:

bash kill.sh

Clean all temporary files:

bash clean.sh

Warning: this will delete all log files and models checkpoints at once, back-up if necessary.

Parameters

Global variables are set in two files:

  • distributed.config: number of worker processes in training, ports of the replay memory servers
  • global_variables.py: all other relative variables in training

My preliminary experimental results on a CPU machine show that , when the number of mempool processes and worker processes reaches 1:4, the writing speed and reading speed of the memory pool can be roughly balanced.

More information (in Chinese)

See Web Server Based Parallel RL Training Framework

About

A webserver-based parallel RL training framework implemented with pytorch, pb, flask, and libshm

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published