-
-
Notifications
You must be signed in to change notification settings - Fork 286
Tasks
Task name |
---|
maze2d-open-v0 |
maze2d-umaze-v0 |
maze2d-medium-v0 |
maze2d-large-v0 |
maze2d-open-dense-v0 |
maze2d-umaze-dense-v0 |
maze2d-medium-dense-v0 |
maze2d-large-dense-v0 |
The Maze2D domain involves moving force-actuated ball (along the X and Y axis) to a fixed target location. The observation consists of the (x, y) location and velocities.
The four maze layouts are shown below (from left to right: open, umaze, medium large):
The four environments maze2d-open-v0, maze2d-umaze-v0, maze2d-medium-v0, maze2d-large-v0 use a sparse reward which is has a value of 1.0 when the agent (light green ball) is within a 0.5 unit radius of the target (light red ball).
Each environment has a dense reward version, which instead uses the negative exponentiated distance as the reward.
Task name |
---|
antmaze-umaze-v0 |
antmaze-umaze-diverse-v0 |
antmaze-medium-diverse-v0 |
antmaze-medium-play-v0 |
antmaze-large-diverse-v0 |
antmaze-large-play-v0 |
The AntMaze domain uses the same umaze, medium, and large mazes from the Maze2D domain, but replaces the agent with the "Ant" robot from the OpenAI Gym MuJoCo benchmark.
The dataset in 'antmaze-umaze-v0' is generated by commanding a fixed goal location from a fixed starting location (these are the opposite sides of the wall in the umaze).
For harder tasks, the "diverse" dataset is generated by commanding random goal locations in the maze and navigating the ant to them. The "play" dataset is generated by commanding specific hand-picked goal locations from hand-picked initial positions.
Task name |
---|
pen-demos-v0 |
pen-cloned-v0 |
pen-expert-v0 |
hammer-demos-v0 |
hammer-cloned-v0 |
hammer-expert-v0 |
door-demos-v0 |
door-cloned-v0 |
door-expert-v0 |
relocate-demos-v0 |
relocate-cloned-v0 |
relocate-expert-v0 |
The Adroit domain involves controlling a 24-DoF robotic hand. There are 4 tasks, from the hand_dapg repository. Clockwise from the top left, they are pen (aligning a pen with a target orientation), door (opening a door), relocate (move a ball to a target position), and hammer (hammer a nail into a board).
There are 3 datasets for each environment.
- Demos uses the 25 human demonstrations provided in the DAPG repository.
- Cloned uses a 50-50 split between demonstration data and 2500 trajectories sampled from a behavioral cloned policy on the demonstrations. The demonstration trajectories are copied to match the number of behavioral cloned trajectories.
- Expert uses 5000 trajectories sampled from an expert that solves the task, provided in the DAPG repository.
Task name |
---|
halfcheetah-random-v0 |
halfcheetah-medium-v0 |
halfcheetah-expert-v0 |
halfcheetah-mixed-v0 |
halfcheetah-medium-expert-v0 |
walker2d-random-v0 |
walker2d-medium-v0 |
walker2d-expert-v0 |
walker2d-mixed-v0 |
walker2d-medium-expert-v0 |
hopper-random-v0 |
hopper-medium-v0 |
hopper-expert-v0 |
hopper-mixed-v0 |
hopper-medium-expert-v0 |
- Random uses 1M samples from a randomly initialized policy.
- Expert uses 1M samples from a policy trained to completion with SAC.
- Medium uses 1M samples from a policy trained to approximately 1/3 the performance of the expert.
- Mixed uses the replay buffer of a policy trained up to the performance of the medium agent.
- Medium-Expert uses a 50-50 split of medium and expert data.