Skip to content

Commit a7f541f

Browse files
upload robomonkey
1 parent 99a01b1 commit a7f541f

File tree

5 files changed

+64
-12
lines changed

5 files changed

+64
-12
lines changed

_blogs/robomonkey.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
---
2+
title: "RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models"
3+
authors:
4+
- key: jackykwok
5+
affiliation: Stanford University
6+
- name: Christopher Agia
7+
affiliation: Stanford University
8+
- name: Rohan Sinha
9+
affiliation: Stanford University
10+
- name: Matt Foutter
11+
affiliation: Stanford University
12+
- name: Shulu Li
13+
affiliation: UC Berkeley
14+
- key: ionstoica
15+
affiliation: UC Berkeley
16+
- key: azaliamirhoseini
17+
affiliation: Stanford University
18+
- key: marcopavone
19+
affiliation: Stanford, NVIDIA
20+
tags:
21+
- robotics
22+
- machine learning
23+
- generative ai
24+
- scaling laws
25+
venue: preprint
26+
year: 2025
27+
date: 2025-06-21
28+
teaser: A monkey with robotic arms
29+
redirect: https://robomonkey-vla.github.io/
30+
materials:
31+
- name: Paper
32+
url: https://arxiv.org/abs/2506.17811
33+
type: file-pdf
34+
- name: Codebase
35+
url: https://github.com/robomonkey-vla/RoboMonkey
36+
type: code
37+
- name: Datasets and Models
38+
url: https://huggingface.co/robomonkey-vla
39+
type: database
40+
- name: Serving Engine
41+
url: https://github.com/robomonkey-vla/sglang-vla
42+
type: code
43+
---

_data/people.yml

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,16 @@ shayantalaei:
1919
url: https://www.linkedin.com/in/shayan-talaei-6b65a0229/
2020
title: PhD Student
2121

22+
anneouyang:
23+
name: Anne Ouyang
24+
url: https://anneouyang.com/
25+
title: PhD Student
26+
27+
simonguo:
28+
name: Simon Guo
29+
url: https://simonguo.tech/
30+
title: PhD Student
31+
2232
jackykwok:
2333
name: Jacky Kwok
2434
url: https://www.linkedin.com/in/jackykwok02/
@@ -86,17 +96,6 @@ lukelee:
8696
affiliation: University College London
8797
not_current: True
8898

89-
# Rotating
90-
anneouyang:
91-
name: Anne Ouyang
92-
url: https://anneouyang.com/
93-
title: Rotating PhD Student
94-
95-
simonguo:
96-
name: Simon Guo
97-
url: https://simonguo.tech/
98-
title: Rotating PhD Student
99-
10099
# Collaborating professors
101100

102101
percyliang:

_pubs/robomonkey.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,21 @@ tags:
2626
- robotics
2727
- machine learning
2828
- generative ai
29+
- scaling laws
2930
teaser: RoboMonkey is a test-time scaling framework that improves the robustness and generalization of Vision-Language-Action (VLA) models. RoboMonkey achieves significant performance improvements across both in-distribution and out-of-distribution tasks, as well as on new robot setups. Our findings show that scaling test-time compute through a generate-then-verify paradigm provides a practical and effective path towards building general-purpose robotics foundation models.
3031
materials:
3132
- name: Paper
3233
url: https://arxiv.org/abs/2506.17811
3334
type: file-pdf
35+
- name: Codebase
36+
url: https://github.com/robomonkey-vla/RoboMonkey
37+
type: code
38+
- name: Datasets and Models
39+
url: https://huggingface.co/robomonkey-vla
40+
type: database
41+
- name: Serving Engine
42+
url: https://github.com/robomonkey-vla/sglang-vla
43+
type: code
3444
---
3545

36-
Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in visuomotor control, yet ensuring their robustness in unstructured real-world environments remains a persistent challenge. In this paper, we investigate test-time scaling through the lens of sampling and verification as means to enhance the robustness and generalization of VLAs. We first demonstrate that the relationship between action error and the number of generated samples follows an exponentiated power law across a range of VLAs, indicating the existence of inference-time scaling laws. Building on these insights, we introduce RoboMonkey, a test-time scaling framework for VLAs. At deployment, RoboMonkey samples a small set of actions from a VLA, applies Gaussian perturbation and majority voting to construct an action proposal distribution, and then uses a Vision Language Model (VLM)-based verifier to select the optimal action. We propose a synthetic data generation pipeline for training such VLM-based action verifiers, and demonstrate that scaling the synthetic dataset consistently improves verification and downstream accuracy. Through extensive simulated and hardware experiments, we show that pairing existing VLAs with RoboMonkey yields significant performance gains, achieving a 25% absolute improvement on out-of-distribution tasks and 8% on in-distribution tasks. Additionally, when adapting to new robot setups, we show that fine-tuning both VLAs and action verifiers yields a 7% performance increase compared to fine-tuning VLAs alone.
46+
Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in visuomotor control, yet ensuring their robustness in unstructured real-world environments remains a persistent challenge. In this paper, we investigate test-time scaling through the lens of sampling and verification as means to enhance the robustness and generalization of VLAs. We first demonstrate that the relationship between action error and the number of generated samples follows an exponentiated power law across a range of VLAs, indicating the existence of inference-time scaling laws. Building on these insights, we introduce RoboMonkey, a test-time scaling framework for VLAs. At deployment, RoboMonkey samples a small set of actions from a VLA, applies Gaussian perturbation and majority voting to construct an action proposal distribution, and then uses a Vision Language Model (VLM)-based verifier to select the optimal action. We propose a synthetic data generation pipeline for training such VLM-based action verifiers, and demonstrate that scaling the synthetic dataset consistently improves verification and downstream accuracy. Through extensive simulated and hardware experiments, we show that pairing existing VLAs with RoboMonkey yields significant performance gains, achieving a 25% absolute improvement on out-of-distribution tasks and 9% on in-distribution tasks. Additionally, when adapting to new robot setups, we show that fine-tuning both VLAs and action verifiers yields a 7% performance increase compared to fine-tuning VLAs alone.

imgs/people/jackykwok.jpg

-136 KB
Loading

pubs/robomonkey.pdf

9.25 MB
Binary file not shown.

0 commit comments

Comments
 (0)