Authors: Divyam Goel, Yufei Wang, Tiancheng Wu, Guixiu Qiao, Pavel Piliptchak, David Held†, Zackory Erickson†
Venue: Accepted as an Oral at CoRL 2025 (selection rate 5.7%).
Abstract:
Standard evaluation protocols in robotic manipulation typically assess policy performance over curated, in‑distribution test sets, offering limited insight into how systems fail under plausible variation. We introduce a red‑teaming framework that probes robustness through object‑centric geometric perturbations, automatically generating CrashShapes—structurally valid, user‑constrained mesh deformations that trigger catastrophic failures in pre‑trained manipulation policies. The method integrates a Jacobian field–based deformation model with a gradient‑free, simulator‑in‑the‑loop optimization strategy. Across insertion, articulation, and grasping tasks, our approach consistently discovers deformations that collapse policy performance, revealing brittle failure modes missed by static benchmarks. By combining task‑level policy rollouts with constraint‑aware shape exploration, we aim to build a general purpose framework for structured, object‑centric robustness evaluation in robotic manipulation. We additionally show that fine‑tuning on individual CrashShapes, a process we refer to as blue‑teaming, improves task success by up to 60 percentage points on those shapes, while preserving performance on the original object, demonstrating the utility of red‑teamed geometries for targeted policy refinement. Finally, we validate both red‑teaming and blue‑teaming results with a real robotic arm, observing that simulated CrashShapes reduce task success from 90% to as low as 22.5%, and that blue‑teaming recovers performance to up to 90% on the corresponding real‑world geometry—closely matching simulation outcomes.
This repository does not ship a ready‑made environment.yml or requirements.txt. Instead, the project relies on the environments defined by its supporting submodules. The parent conda environment for all red‑teaming experiments should be the isaacgym environment created when installing the simulation back‑end. You will also need an apap environment for mesh deformation. Consult the submodule READMEs for their respective dependencies.
Two submodules contain critical components and must be set up before running any experiments:
src/apap– provides the As‑Plausible‑As‑Possible (APAP) deformation code used to generate CrashShapes. Follow the README inside the submodule for setup and tests. Create its own environment (often namedapap) and run the basic tests described there.src/isaacgym– wraps the IsaacGym simulation environments. Follow the README inside the submodule to set up the task environments and train the baseline policies before red‑teaming. The trained policies (checkpoints) produced here will be referenced by the red‑teaming pipeline.
The optimisation code in this repository implements a population‑based optimisation loop to discover CrashShapes—confounding mesh geometries that cause a trained manipulation policy to fail. Each red‑teaming run samples deformations of a nominal object (using the APAP submodule), assembles the object in the simulator, evaluates the policy’s success on the deformed shape, and updates the distribution of deformations over multiple iterations.
High‑level flow:
- Nominal asset: A nominal object mesh (e.g.
io.source_mesh_filein the per-task YAML, such asconf/cem_drawer.yaml) and associated keypoint files (io.anchor_points_file,io.handle_points_file,io.target_mesh,io.alignment_reference) are prepared inapap_output_dir. - Sampling and deformation: Each iteration, the CEM sampler (standard, clustering, TOPDM, etc.) proposes handle displacements. These displacements are passed to APAP via the
./apapsubmodule (seeDrawerOptimizer._run_apap_deformationfor an example). APAP produces deformed meshes in${io.apap_output_dir}/outputsand writes them tooutput_diras.objfiles. - Simulation: Deformed meshes are converted to simulator assets (via COACD and URDF generation) and evaluated using the IndustReal/IsaacGym environment (see
src/core/evaluation/isaac_*classes). The policy and task are specified undersim.policyin the tasl specific YAML file; the checkpoint should point to a trained policy from theisaacgymsubmodule. - Logging: Each trial’s result is logged to a Parquet dataset under
io.parquet_dataset. The dataset captures the displacement vectors, success rate, and any per‑environment metrics. You can later analyse the data using the provided scripts.
Outputs from a run include:
- Deformed mesh files in
${output_dir}and${io.apap_out_root}. - Parquet logs in
${io.parquet_dataset}documenting per‑iteration successes and failures. - Optional simulator logs in
${output_dir}/sim_logs(cleaned up automatically after runs).
Before running red‑teaming, you must prepare assets by extracting anchor and handle points for each object. We provide a two‑stage Visual–Language Model (VLM) inference pipeline based on ChatGPT‑4o to identify meaningful keypoints from both the point cloud and rendered images of each mesh. This pipeline is implemented in src/scripts/prepare_dataset.py.
To run the preparation pipeline, supply an input directory of OBJ meshes, a metadata file describing which objects to process, an output directory and an API key for ChatGPT. The script writes a subdirectory for each processed object containing:
mesh.obj– the nominal mesh used by APAP.alignment_ref.objandtarget.obj– centred reference meshes used for evaluation.handle_points.txtand/oranchor_points.txt– lists of keypoint indices and coordinates returned by the VLM. The presence of these files is governed by thetop_down_anchorsfield in the metadata.
Run the pipeline as follows:
conda activate apap
export API_KEY=<your_chatgpt_api_key>
python -m src.scripts.prepare_dataset \
--input_dir <preparation_dir> \
--output_dir <output_dir> \
--api_key $API_KEY \
--metadata metadata.jsonThe preparation_dir should contain the OBJ meshes to be processed. The output_dir will be populated with one subdirectory per mesh, each holding the APAP input files mentioned above. After preparation, set apap_output_dir in your task specific YAML file to point to this output_dir so that the optimiser reads the correct anchor and handle files.
The metadata.json file is a list of objects with flags controlling the VLM prompts. For example:
[
{
"object_name": "obj1",
"red_team_asset": true,
"top_down_anchors": true
},
{
"object_name": "obj2",
"red_team_asset": true,
"top_down_anchors": false
}
]The red_team_asset flag selects which objects to process. When top_down_anchors is true, the anchor points written to anchor_points.txt are such that the base of the object remains stable. For tasks other than grasping (e.g. insertion), you may need to adjust the prompts used by the VLM. Edit the "task" field in src/utils/vlm_utils.py accordingly.
The repository defines several entry points under src/, each paired with its own Hydra configuration file under conf/ (e.g. cem_drawer.yaml, cem_manipulation.yaml, cem_grasping.yaml). Before running any of these commands, train or load the corresponding policy using the src/isaacgym submodule, and update the task-specific YAML with the correct checkpoint path. Some tasks may require additional setup, these instructions have been collated into task specific documentation provided in src/isaacgym/docs.
conda activate isaacgym
python -m src.cem_drawerThis launches the TOPDM optimiser for the drawer opening environment. It reads the Hydra configuration from conf/cem_drawer.yaml and writes deformed meshes to ${output_dir} (derived from io.apap_output_dir). Parquet logs are stored in ${io.parquet_dataset}. Before running, set sim.policy.task: FrankaCabinet and sim.policy.checkpoint to the state‑based drawer policy trained in src/isaacgym. The objectives for this task should reflect Opening Distance; set sim.objectives: ["Opening Distance"] and adjust weights and directions as appropriate. Follow the drawer task instructions in the isaacgym submodule to obtain the policy checkpoint.
conda activate isaacgym
python -m src.cem_manipulationThis runs the red‑teaming pipeline for the peg insertion task using a state‑based policy. The default conf/cem_manipulation.yaml sets sim.policy.task: IndustRealTaskPegsInsert, sim.objectives: ["Insertion Success"] and expects a checkpoint. Update sim.policy.checkpoint to the correct insertion policy file from src/isaacgym before running. Outputs are logged to the same directories as above.
conda activate isaacgym
python -m src.cem_manipulation +sim.policy.pcd_policy=TrueThis command enables the point‑cloud observation mode for the insertion policy by setting sim.policy.pcd_policy=True via Hydra. Ensure that the insertion policy in src/isaacgym has been trained in point‑cloud mode and update sim.policy.checkpoint accordingly. Other configuration keys remain the same as the state‑based insertion run.
conda activate isaacgym
python -m src.cem_graspingThis entry point red‑teams a grasp and pick policy. The optimisation loop will sample deformations, call APAP, assemble the objects and evaluate them using the grasp simulation. Set sim.policy.task: GraspTaskPick. A checkpoint is not required for this task because the weights for Contact Graspnet are automatically loaded by the task script. The sim.objectives list should include "Grasp Success" with corresponding weights and directions. Outputs mirror the previous tasks.
Each entry point has a dedicated configuration file under conf/, named to match the Python entry point:
conf/cem_drawer.yaml– for drawer openingconf/cem_manipulation.yaml– for insertion (state-based or point-cloud via+sim.policy.pcd_policy=True)conf/cem_grasping.yaml– for grasping
All three share the same schema for io, sim, and hparams. You will typically only need to edit:
- Paths in
io: point to your prepared assets and output directories. sim.policy.checkpoint: set to the trained policy checkpoint (if required for the task).- Objectives in
sim.objectives: use the defaults provided in each YAML unless you have custom evaluation goals.
Example (conf/cem_manipulation.yaml):
io:
source_mesh_file: "mesh.obj"
apap_output_dir: "~/Documents/data/apap_data/usb"
apap_out_root: "~/Documents/data/apap_outputs" # output directories for mesh deformations
parquet_dataset: "~/Documents/data/parquet_dataset"
sim:
objectives: ["Insertion Success"]
objective_weights: [1.0]
objective_directions: ["min"]
policy:
task: IndustRealTaskPegsInsert
checkpoint: "/path/to/trained_policy.pth"
hparams:
approach: topdm
seed: 42
max_iters: 10
population_size: 10
elite_frac: 0.4
After running red‑teaming experiments, you can aggregate and analyse the results using the scripts in src/scripts/. Replace paths as appropriate for your environment.
Aggregate per‑object statistics:
python analyze_data.py \
--parquet-dir ~/Documents/data/parquet_dataset \
--baseline-file ../../results.json \
--output-file ../../data_stats.jsonlThis script reads all Parquet files in --parquet-dir (produced during red‑teaming), compares them against a JSON of baseline success rates (--baseline-file, one object per entry) and writes aggregated statistics to --output-file. The baseline JSON should contain nominal success rates and shape complexities for each object; consult our evaluation protocol for its exact format.
Compute high‑level metrics:
python metrics.py ../../data_stats.jsonlThis prints the mean area under the degradation curve, final drop in success rate, number of iterations until the first catastrophic failure, and median Δshape complexity across all objects in the JSONL file.
Render mesh GIFs:
python render_all.py <path_to_mesh_dir> <output_dir>This renders turntable GIFs for all .obj files in <path_to_mesh_dir> using Open3D and writes them to <output_dir>. Use this for qualitative inspection of CrashShapes.
The repository is organised as follows:
conf/– Hydra configuration files for all entrypoints.src/– Python source code. Key entry points live insrc/cem_drawer.py,src/cem_manipulation.pyandsrc/cem_grasping.py. The core optimisation logic is insrc/core/optimization/andsrc/core/orchestration/. Evaluators for different environments are undersrc/core/evaluation/.src/scripts/– analysis and utility scripts. Notable ones includeanalyze_data.py,metrics.py,render_all.pyfor post‑processing;mesh_utils.pyandshape_complexity.pyfor mesh manipulation; andprepare_dataset.pyfor asset preparation using the VLM pipeline.src/apap/andsrc/isaacgym/– git submodules providing the deformation and simulation back‑ends. These are external repositories; follow their documentation for installation and testing.
Short answer: a modern NVIDIA GPU with at least 24 GB of VRAM is recommended.
Details:
- Reference setup used for the paper: four NVIDIA RTX A6000 GPUs.
- Single-GPU configurations verified: RTX 4090 and RTX 3090.
- Architectural compatibility: tested on modern NVIDIA architectures with SM 86+.
Guidance:
- A single 24 GB+ GPU is sufficient to run the code and reproduce our experiments at the recommended scales.
- Multi-GPU setups improve throughput and wall-clock time but are not required for correctness.
- If you intend to match the scale of our reported runs, plan for resources comparable to the reference setup above.
We thank the authors of the following repositories for open-sourcing their code, which our codebase is built upon:
- APAP: https://as-plausible-as-possible.github.io/
- IsaacGym: https://github.com/isaac-sim/IsaacGymEnvs/tree/automate
If you build on this work, please cite the associated CoRL 2025 paper.
@misc{goel2025geometricredteamingroboticmanipulation,
title={Geometric Red-Teaming for Robotic Manipulation},
author={Divyam Goel and Yufei Wang and Tiancheng Wu and Guixiu Qiao and Pavel Piliptchak and David Held and Zackory Erickson},
year={2025},
eprint={2509.12379},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2509.12379},
}