Skip to content
Justin Fu edited this page Apr 18, 2020 · 19 revisions

Maze2D

Task name
maze2d-open-v0
maze2d-umaze-v0
maze2d-medium-v0
maze2d-large-v0
maze2d-open-dense-v0
maze2d-umaze-dense-v0
maze2d-medium-dense-v0
maze2d-large-dense-v0

The Maze2D domain involves moving force-actuated ball (along the X and Y axis) to a fixed target location. The observation consists of the (x, y) location and velocities.

The four maze layouts are shown below (from left to right: open, umaze, medium large):

The four environments maze2d-open-v0, maze2d-umaze-v0, maze2d-medium-v0, maze2d-large-v0 use a sparse reward which is has a value of 1.0 when the agent (light green ball) is within a 0.5 unit radius of the target (light red ball).

Each environment has a dense reward version, which instead uses the negative exponentiated distance as the reward.

AntMaze

Task name
antmaze-umaze-v0
antmaze-umaze-diverse-v0
antmaze-medium-diverse-v0
antmaze-medium-play-v0
antmaze-large-diverse-v0
antmaze-large-play-v0

The AntMaze domain uses the same umaze, medium, and large mazes from the Maze2D domain, but replaces the agent with the "Ant" robot from the OpenAI Gym MuJoCo benchmark.

The dataset in 'antmaze-umaze-v0' is generated by commanding a fixed goal location from a fixed starting location (these are the opposite sides of the wall in the umaze).

For harder tasks, the "diverse" dataset is generated by commanding random goal locations in the maze and navigating the ant to them. The "play" dataset is generated by commanding specific hand-picked goal locations from hand-picked initial positions.

Adroit

Task name
pen-demos-v0
pen-cloned-v0
pen-expert-v0
hammer-demos-v0
hammer-cloned-v0
hammer-expert-v0
door-demos-v0
door-cloned-v0
door-expert-v0
relocate-demos-v0
relocate-cloned-v0
relocate-expert-v0

The Adroit domain involves controlling a 24-DoF robotic hand. There are 4 tasks, from the hand_dapg repository. Clockwise from the top left, they are pen (aligning a pen with a target orientation), door (opening a door), relocate (move a ball to a target position), and hammer (hammer a nail into a board).

There are 3 datasets for each environment.

  • Demos uses the 25 human demonstrations provided in the DAPG repository.
  • Cloned uses a 50-50 split between demonstration data and 2500 trajectories sampled from a behavioral cloned policy on the demonstrations. The demonstration trajectories are copied to match the number of behavioral cloned trajectories.
  • Expert uses 5000 trajectories sampled from an expert that solves the task, provided in the DAPG repository.

Gym

Task name
halfcheetah-random-v0
halfcheetah-medium-v0
halfcheetah-expert-v0
halfcheetah-mixed-v0
halfcheetah-medium-expert-v0
walker2d-random-v0
walker2d-medium-v0
walker2d-expert-v0
walker2d-mixed-v0
walker2d-medium-expert-v0
hopper-random-v0
hopper-medium-v0
hopper-expert-v0
hopper-mixed-v0
hopper-medium-expert-v0
  • Random uses 1M samples from a randomly initialized policy.
  • Expert uses 1M samples from a policy trained to completion with SAC.
  • Medium uses 1M samples from a policy trained to approximately 1/3 the performance of the expert.
  • Mixed uses the replay buffer of a policy trained up to the performance of the medium agent.
  • Medium-Expert uses a 50-50 split of medium and expert data.
Clone this wiki locally