Defending against adversarial policies in YouShallNotPass by running adversarial fine-tuning. Policies are trained in an alternating fashion: after training the adversary for t1 steps, the victim is trained for t2 steps, then the adversary is trained again for t3 time-steps and so on. Training times ti increase exponentially.
Bursts training: (left) training opponents ('normal' pre-trained, adversary trained from scratch, victim policy) in an alternating way (middle) 'burst' size (right) win rate

Bursts training: (left) mean reward for agents, (right) value loss for agents

In this repository:
YouShallNotPassenvironment is exported into rllib as a multiagent environment- Training in 'bursts' is implemented: victim or the adversary are trained against each other, the policy trained changes every ti time-steps, and ti increase exponentially
- Victim is trained against multiple adversaries as well as the normal opponent ('population-based training')
- Stable Baselines are connected to rllib to train by sampling with rllib and optimizing with Stable Baslines
Very simple: pull a Docker image
-
First, pull the image:
$ docker pull humancompatibleai/better-adversarial-defenses -
To run tests (will ask for a MuJoCo license)
$ docker run -it humancompatibleai/better-adversarial-defenses -
To run the terminal:
$ docker run -it humancompatibleai/better-adversarial-defenses /bin/bash
Click to open
- Install Docker and git
- Clone the repository:
$ git clone https://github.com/HumanCompatibleAI/better-adversarial-defenses.git - Build the Docker image:
$ docker build -t ap_rllib better-adversarial-defenses - Run tests:
$ docker container run -it ap_rllib - Run shell:
$ docker container run -it ap_rllib /bin/bash
Click to open
Assuming Ubuntu Linux distribution or a compatible one.
Tested in Ubuntu 18.04.5 LTS and WSL. GPU is not required for the project.
Full installation can be found in Dockerfile.
- Install miniconda
$ git clone --recursive https://github.com/HumanCompatibleAI/better-adversarial-defenses.git- Create environments from files
adv-tf1.ymlandadv-tf2.yml(tf1 is used for stable baselines, and tf2 is used for rllib):$ conda env create -f adv-tf1.yml$ conda env create -f adv-tf2.yml
- Install MuJoCo 1.13. On headless setups, install Xvfb
- Install MongoDB and create a database
chai - Install
gym_competeandaprlvia setup.py (included into the repository as submodules):$ pip install -e multiagent-competition$ pip install -e pip install -e adversarial-policies
- Having ray 0.8.6 installed, run
$ python ray/python/ray/setup-dev.pyto patch your ray installation - Install fonts for rendering:
$ conda install -c conda-forge mscorefonts; mkdir ~/.fonts; cp $CONDA_PREFIX/fonts/*.ttf ~/.fonts; fc-cache -f -v - Install the project:
$ pip install -e .
-
To test the setup with rllilb PPO trainer, run:
(adv-tf2) $ python -m ap_rllib.train --tune test-
The script will automatically log results to Sacred and Tune
-
By-default, the script asks which configuration to run, but it can be set manually with the
--tuneargument. -
Log files will appear in
~/ray_results/run_type/run_name. Use TensorBoard in this folder., -
Checkpoints will be in
~/ray_results/xxx/checkpoint_n/wherexxxandnare stored in the log files, one entry for every iteration. See an example notebook or a script obtaining the last checkpoint for details on how to do that. -
Some specifig configurations:
--tune external_cartpoleruns training in InvertedPendulum, using Stable Baselines PPO implementation.- Before running, launch the Stable Baselines server
(adv-tf1) $ python -m frankenstein.stable_baselines_server- By-default, each policy is trained in a separate thread, so that environment data collection resumes as soon as possible
- However, this increases the number of threads significantly in case of PBT and many parallel tune trials.
- If the number of threads is too high, the
--serialoption disables multi-threaded training in Stable Baselines Server - The overhead is not that significant, as training finishes extremely quickly compared to data collection
- Before running, launch the Stable Baselines server
--tune bursts_exp_withnormal_pbt_sbwill run training with Stable Baselines + Bursts + Normal opponent included + PBT (multiple adversaries)
-
--verboseenables some additional output -
--show_configonly shows configuration and exits -
--resumewill re-start trials if there are already trials in the results directory with this name- notebook tune_pre_restart.ipynb allows to convert ray 0.8.6 checkpoints to ray 1.0.1 checkpoints
-
If you want to quickly iterate with your config (use smaller batch size and no remote workers), pass an option to the trainer
--config_override='{"train_batch_size": 1000, "sgd_minibatch_size": 1000, "num_workers": 0, "_run_inline": 1}' -
Large number of processes might run into the open files limit. This might help:
ulimit -n 999999
-
-
To make a video:
-
(only on headless setups):
$ Xvfb -screen 0 1024x768x24&; export DISPLAY=:0 -
Run
(adv-tf2) $ python -m ap_rllib.make_video --checkpoint path/to/checkpoint/checkpoint-xxx --config your-config-at-training --display $DISPLAY--steps nnumber of steps to make (1 is256steps which is approximately 1 episode)--load_normal Trueevaluate against normal opponent instead of the trained one--no_video Truewill disable video. Use this to evaluate the performance with more episodes faster
-
- We use ray because of its multi-agent support, and thus we have to use TensorFlow 2.0
- We use stable baselines for training because we were unable to replicate results with rllib, even with an independent search for hyperparameters.
- We checkpoint the ray trainer and restore it, and run the whole thing in a separate process to circumvent the ray memory leak issue
Click to open
Files:
ap_rllib/train.pythe main train scriptap_rllib/config.pyconfigurations for the train scriptap_rllib/helpers.pyhelper functions for the whole projectap_rllib/make_video.pycreates videos for the policiesfrankenstein/remote_trainer.pyimplements an RLLib trainer that pickles data and sends the filename via HTTPfrankenstein/stable_baselines_server.pyimplements an HTTP server that waits for weights and samples, then trains the policy and returns the updated weightsfrankenstein/stable_baselines_external_data.pyimplements the 'fake' Runner that allows for the training using Stable Baselines ppo2 algorithm on existing datagym_compete_rllib/gym_compete_to_rllib.pyimplements the adapter for themulticomptorllibenvironments, and therllibpolicy that loads pre-trained weights frommulticompgym_compete_rllib/load_gym_compete_policy.pyloads themulticompweights into a keras policygym_compete_rllib/layers.pyimplements the observation/value function normalization code fromMlpPolicyValue(multiagent-competition/gym_compete/policy.py)
Folders:
ap_rllib_experiment_analysis/notebookscontains notebooks that analyze runsap_rllib_experiment_analysiscontains scripts that help with analyzing runsfrankensteincontains the code for integrating Stable Baselines and RLLibgym_compete_rllibconnects rllib to themulticompenvironment
Submodules:
adversarial-policiesis the original project by Adam Gleavemultiagent-competitioncontains the environments used in the original project, as well as saved weightsrayis a copy of therayrepository with patches to make the project work
memory_profile,oom_dummycontains files and data to analyze the memory leakrock_paper_scissorscontain code with sketch implementations of ideas on Rock-Paper-Scissors gametf_agents_ysp.pyimplements training inYouShallNotPasswith tf-agentsrlpyt_run.pyimplements training inYouShallNotPasswith rlpytrs.ipynbimplements random search with a constant output policy inYouShallNotPassevolve.ipynbandevolve.pyimplement training inYouShallNotPasswith neat-python