Skip to content

Latest commit

 

History

History
184 lines (117 loc) · 14.1 KB

README.md

File metadata and controls

184 lines (117 loc) · 14.1 KB

Build Status

SuperSuit introduces a collection of small functions which can wrap reinforcement learning environments to do preprocessing ('microwrappers'). We support Gym for single agent environments and PettingZoo for multi-agent environments (both AECEnv and ParallelEnv environments). Using it to convert space invaders to have a grey scale observation space and stack the last 4 frames looks like:

import gym
from supersuit import color_reduction_v0, frame_stack_v1

env = gym.make('SpaceInvaders-v0')

env = frame_stack_v1(color_reduction_v0(env, 'full'), 4)

Similarly, using SuperSuit with PettingZoo environments looks like

from pettingzoo.butterfly import pistonball_v0
env = pistonball_v0.env()

env = frame_stack_v1(color_reduction_v0(env, 'full'), 4)

You can install SuperSuit via pip install supersuit

Included Functions

clip_reward_v0(env, lower_bound=-1, upper_bound=1) clips rewards to between lower_bound and upper_bound. This is a popular way of handling rewards with significant variance of magnitude, especially in Atari environments.

clip_actions_v0(env) clips Box actions to be within the high and low bounds of the action space. This is a standard transformation applied to environments with continuous action spaces to keep the action passed to the environment within the specified bounds.

color_reduction_v0(env, mode='full') simplifies color information in graphical ((x,y,3) shaped) environments. mode='full' fully greyscales of the observation. This can be computationally intensive. Arguments of 'R', 'G' or 'B' just take the corresponding R, G or B color channel from observation. This is much faster and is generally sufficient.

dtype_v0(env, dtype) recasts your observation as a certain dtype. Many graphical games return uint8 observations, while neural networks generally want float16 or float32. dtype can be anything NumPy would except as a dtype argument (e.g. np.dtype classes or strings).

flatten_v0(env) flattens observations into a 1D array.

frame_skip_v0(env, num_frames) skips num_frames number of frames by reapplying old actions over and over. Observations skipped over are ignored. Rewards skipped over are accumulated. Like Gym Atari's frameskip parameter, num_frames can also be a tuple (min_skip, max_skip), which indicates a range of possible skip lengths which are randomly chosen from (in single agent environments only).

delay_observations_v0(env, delay) Delays observation by delay frames. Before delay frames have been executed, the observation is all zeros. Along with frame_skip, this is the preferred way to implement reaction time for high FPS games.

sticky_actions_v0(env, repeat_action_probability) assigns a probability of an old action "sticking" to the environment and not updating as requested. This is to prevent agents from learning predefined action patterns in highly deterministic games like Atari. Note that the stickiness is cumulative, so an action has a repeat_action_probability^2 chance of an action sticking for two turns in a row, etc. This is the recommended way of adding randomness to Atari by "Machado et al. (2018), "Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents"

frame_stack_v1(env, num_frames=4) stacks the most recent frames. For vector games observed via plain vectors (1D arrays), the output is just concatenated to a longer 1D array. 2D or 3D arrays are stacked to be taller 3D arrays. At the start of the game, frames that don't yet exist are filled with 0s. num_frames=1 is analogous to not using this function.

max_observation_v0(env, memory) the resulting observation becomes the max over memory number of prior frames. This is important for Atari environments, as many games have elements that are intermitently flashed on the instead of being constant, due to the peculiarities of the console and CRT TVs. The OpenAI baselines MaxAndSkip Atari wrapper is equivalent to doing memory=2 and then a frame_skip of 4.

normalize_obs_v0(env, env_min=0, env_max=1) linearly scales observations to the range env_min (default 1) to env_max (default 0), given the known minimum and maximum observation values defined in the observation space. Only works on Box observations with float32 or float64 dtypes and finite bounds. If you wish to normalize another type, you can first apply the dtype wrapper to convert your type to float32 or float64.

reshape_v0(env, shape) reshapes observations into given shape.

resize_v0(env, x_size, y_size, linear_interp=False) Performs interpolation to up-size or down-size observation image using area interpolation by default. Linear interpolation is also available by setting linear_interp=True (it's faster and better for up-sizing). This wrapper is only available for 2D or 3D observations, and only makes sense if the observation is an image.

Included Multi-Agent Only Functions

agent_indicator_v0(env, type_only=False) Adds an indicator of the agent ID to the observation, only supports discrete and 1D, 2D, and 3D box. For 1d spaces, the agent ID is converted to a 1-hot vector and appended to the observation (increasing the size of the observation space as necessary). 2d and 3d spaces are treated as images (with channels last) and the ID is converted to n additional channels with the channel that represents the ID as all 1s and the other channel as all 0s (a sort of one hot encoding). This allows MADRL methods like parameter sharing to learn policies for heterogeneous agents since the policy can tell what agent it's acting on. Set the type_only parameter to parse the name of the agent as <type>_<n> and have the appended 1-hot vector only identify the type, rather than the specific agent name. This is useful for games where there are many agents in an environment but few types of agents. Agent indication for MADRL was first introduced in Cooperative Multi-Agent Control Using Deep Reinforcement Learning.

black_death_v1(env) Instead of removing dead actions, observations and rewards are 0 and actions are ignored. This can simplify handling agent death mechanics. The name "black death" does not come from the plague, but from the fact that you see a black image (an image filled with zeros) when you die.

pad_action_space_v0(env) pads the action spaces of all agents to be be the same as the biggest, per the algorithm posed in Parameter Sharing is Surprisingly Useful for Deep Reinforcement Learning. This enables MARL methods that require homogeneous action spaces for all agents to work with environments with heterogeneous action spaces. Discrete actions inside the padded region will be set to zero, and Box actions will be cropped down to the original space.

pad_observations_v0(env) pads observations to be of the shape of the largest observation of any agent with 0s, per the algorithm posed in Parameter Sharing is Surprisingly Useful for Deep Reinforcement Learning. This enables MARL methods that require homogeneous observations from all agents to work in environments with heterogeneous observations. This currently supports Discrete and Box observation spaces.

Environment Vectorization

These functions turn plain Gym environments into vectorized environments, for every common vector environment spec.

gym_vec_env_v0(env, num_envs, multiprocessing=False) creates a Gym vector environment with num_envs copies of the environment. If multiprocessing is True, AsyncVectorEnv is used instead of SyncVectorEnv.

stable_baselines_vec_env_v0(env, num_envs, multiprocessing=False) creates a stable_baselines vector environment with num_envs copies of the environment. If multiprocessing is True, SubprocVecEnv is used instead of DummyVecEnv. Needs stable_baselines to be installed to work.

stable_baselines3_vec_env_v0(env, num_envs, multiprocessing=False) creates a stable_baselines vector environment with num_envs copies of the environment. If multiprocessing is True, SubprocVecEnv is used instead of DummyVecEnv. Needs stable_baselines3 to be installed to work.

concat_vec_envs_v0(vec_env, num_vec_envs, num_cpus=0, base_class='gym') takes in an vec_env which is vector environment (should not have multithreading enabled). Creates a new vector environment with num_vec_envs copies of that vector environment concatenated together and runs them on num_cpus cpus as balanced as possible between cpus. num_cpus=0 means to create 0 new threads, i.e. run the process in an efficient single threaded manner. A use case for this function is given below. If the base class of the resulting vector environment matters as it does for stable baselines, you can use the base_class parameter to switch between "gym" base class and "stable_baselines3"'s base class. Note that both have identical functionality.

Parallel Environment vectorization

Note that a multi-agent environment has a similar interface to a vector environment. Give each possible agent an index in the vector and the vector of agents can be interpreted as a vector of "environments":

agent_1
agent_2
agent_3
...

Where each agent's observation, reward, done, and info will be that environment's data.

The following function performs this conversion.

pettingzoo_env_to_vec_env_v0(env): Takes a PettingZoo ParallelEnv with the following assumptions: no agent death or generation, homogeneous action and observation spaces. Returns a gym vector environment where each "environment" in the vector represents one agent. An arbitrary PettingZoo parallel environment can be enforced to have these assumptions by wrapping it with the pad_action_space, pad_observations, and the black_death wrapper). This conversion to a vector environment can be used to train appropriate pettingzoo environments with standard single agent RL methods such as stable baselines's A2C out of box (example below).

You can also use the concat_vec_envs_v0 functionality to train on several vector environments in parallel, forming a vector which looks like

env_1_agent_1
env_1_agent_2
env_1_agent_3
env_2_agent_1
env_2_agent_2
env_2_agent_3
...

So you can for example train 4 copies of pettingzoo's pistonball environment in parallel with some code like:

from stable_baselines3 import PPO
from pettingzoo.butterfly import pistonball_v3
from supersuit import concat_vec_envs_v0, pettingzoo_env_to_vec_env_v0

env = pistonball_v3.parallel_env()
env = pettingzoo_env_to_vec_env_v0(env)
env = concat_vec_envs_v0(env, 4, base_class='stable_baselines3')

model = PPO('CnnPolicy', env, verbose=3, n_steps=16)
model.learn(total_timesteps=1000000)

vectorize_aec_env_v0(aec_env, num_envs, num_cpus=0) creates an AEC Vector env (API documented in source here). num_cpus=0 indicates that the process will run in a single thread. Values of 1 or more will spawn at most that number of processes.

Note on multiprocessing

Turning on multiprocessing runs each environment in it's own process. Turning this on is typically much slower for fast environments (like card games), but much faster for slow environments (like robotics simulations). Determining which case you are will require testing.

Lambda Functions

If none of the included in micro-wrappers are suitable for your needs, you can use a lambda function (or submit a PR).

action_lambda_v0(env, change_action_fn, change_space_fn) allows you to define arbitrary changes to the actions via change_action_fn(action, space) : action and to the action spaces with change_space_fn(action_space) : action_space. Remember that you are transforming the actions received by the wrapper to the actions expected by the base environment.

observation_lambda_v0(env, observation_fn, observation_space_fn=None) allows you to define arbitrary changes to the via observation_fn(observation) : observation, and observation_space_fn(obs_space) : obs_space. For Box-Box transformations the space transformation will be inferred from change_observation_fn if change_obs_space_fn=None by passing the high and low bounds through the observation_space_fn.

reward_lambda_v0(env, change_reward_fn) allows you to make arbitrary changes to rewards by passing in a change_reward_fn(reward) : reward function. For Gym environments this is called every step to transform the returned reward. For AECEnv, this function is used to change each element in the rewards dictionary every step.

Lambda Function Examples

Adding noise to a Box observation looks like:

env = observation_lambda_v0(env, lambda x : x + np.random.normal(size=x.shape))

Adding noise to a box observation and increasing the high and low bounds to accommodate this extra noise looks like:

env = observation_lambda_v0(env,
    lambda x : x + np.random.normal(size=x.shape),
    lambda obs_space : gym.spaces.Box(obs_space.low-5,obs_space.high+5))

Changing 1d box action space to a Discrete space by mapping the discrete actions to one-hot vectors looks like:

def one_hot(x,n):
    v = np.zeros(n)
    v[x] = 1
    return v

env = action_lambda_v0(env,
    lambda action, act_space : one_hot(action, act_space.shape[0]),
    lambda act_space : gym.spaces.Discrete(act_space.shape[0]))

Citation

If you use this in your research, please cite:

@article{SuperSuit,
  Title = {SuperSuit: Simple Microwrappers for Reinforcement Learning Environments},
  Author = {Terry, Justin K and Black, Benjamin and Hari, Ananth},
  journal={arXiv preprint arXiv:2008.08932},
  year={2020}
}

Reward Program

We have a sort bug/documentation error bounty program, inspired by Donald Knuth's reward checks. People who make mergable PRs which properly address meaningful problems in the code, or which make meaningful improvements to the documentation, can recieve a negotiable check for "hexadecimal dollar" ($2.56) mailed to them, or sent to them via PayPal. To redeem this, just send an email to justinkterry@gmail.com with your mailing adress or PayPal adress. We also pay out 32 cents for small fixes.