Skip to content

Commit 909d688

Browse files
Merge pull request #34 from robomonkey-vla/main
RoboMonkey Paper
2 parents da27db6 + a7f541f commit 909d688

File tree

9 files changed

+115
-11
lines changed

9 files changed

+115
-11
lines changed

_blogs/robomonkey.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
---
2+
title: "RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models"
3+
authors:
4+
- key: jackykwok
5+
affiliation: Stanford University
6+
- name: Christopher Agia
7+
affiliation: Stanford University
8+
- name: Rohan Sinha
9+
affiliation: Stanford University
10+
- name: Matt Foutter
11+
affiliation: Stanford University
12+
- name: Shulu Li
13+
affiliation: UC Berkeley
14+
- key: ionstoica
15+
affiliation: UC Berkeley
16+
- key: azaliamirhoseini
17+
affiliation: Stanford University
18+
- key: marcopavone
19+
affiliation: Stanford, NVIDIA
20+
tags:
21+
- robotics
22+
- machine learning
23+
- generative ai
24+
- scaling laws
25+
venue: preprint
26+
year: 2025
27+
date: 2025-06-21
28+
teaser: A monkey with robotic arms
29+
redirect: https://robomonkey-vla.github.io/
30+
materials:
31+
- name: Paper
32+
url: https://arxiv.org/abs/2506.17811
33+
type: file-pdf
34+
- name: Codebase
35+
url: https://github.com/robomonkey-vla/RoboMonkey
36+
type: code
37+
- name: Datasets and Models
38+
url: https://huggingface.co/robomonkey-vla
39+
type: database
40+
- name: Serving Engine
41+
url: https://github.com/robomonkey-vla/sglang-vla
42+
type: code
43+
---

_data/people.yml

Lines changed: 26 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,21 @@ shayantalaei:
1919
url: https://www.linkedin.com/in/shayan-talaei-6b65a0229/
2020
title: PhD Student
2121

22+
anneouyang:
23+
name: Anne Ouyang
24+
url: https://anneouyang.com/
25+
title: PhD Student
26+
27+
simonguo:
28+
name: Simon Guo
29+
url: https://simonguo.tech/
30+
title: PhD Student
31+
32+
jackykwok:
33+
name: Jacky Kwok
34+
url: https://www.linkedin.com/in/jackykwok02/
35+
title: PhD Student
36+
2237
# Visiting
2338

2439
bradleybrown:
@@ -81,17 +96,6 @@ lukelee:
8196
affiliation: University College London
8297
not_current: True
8398

84-
# Rotating
85-
anneouyang:
86-
name: Anne Ouyang
87-
url: https://anneouyang.com/
88-
title: Rotating PhD Student
89-
90-
simonguo:
91-
name: Simon Guo
92-
url: https://simonguo.tech/
93-
title: Rotating PhD Student
94-
9599
# Collaborating professors
96100

97101
percyliang:
@@ -100,6 +104,17 @@ percyliang:
100104
title: Professor
101105
not_current: True
102106

107+
marcopavone:
108+
name: Marco Pavone
109+
url: https://research.nvidia.com/person/marco-pavone
110+
title: Professor
111+
not_current: True
112+
113+
ionstoica:
114+
name: Ion Stoica
115+
url: https://people.eecs.berkeley.edu/~istoica/
116+
title: Professor
117+
not_current: True
103118
# Alumni
104119

105120
#example:

_pubs/robomonkey.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
---
2+
title: "RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models"
3+
authors:
4+
- key: jackykwok
5+
affiliation: Stanford University
6+
- name: Christopher Agia
7+
affiliation: Stanford University
8+
- name: Rohan Sinha
9+
affiliation: Stanford University
10+
- name: Matt Foutter
11+
affiliation: Stanford University
12+
- name: Shulu Li
13+
affiliation: UC Berkeley
14+
- key: ionstoica
15+
affiliation: UC Berkeley
16+
- key: azaliamirhoseini
17+
affiliation: Stanford University
18+
- key: marcopavone
19+
affiliation: Stanford, NVIDIA
20+
venue: preprint
21+
year: 2025
22+
date: 2025-06-21
23+
has_pdf: true
24+
doi: 10.48550/arXiv.2506.17811
25+
tags:
26+
- robotics
27+
- machine learning
28+
- generative ai
29+
- scaling laws
30+
teaser: RoboMonkey is a test-time scaling framework that improves the robustness and generalization of Vision-Language-Action (VLA) models. RoboMonkey achieves significant performance improvements across both in-distribution and out-of-distribution tasks, as well as on new robot setups. Our findings show that scaling test-time compute through a generate-then-verify paradigm provides a practical and effective path towards building general-purpose robotics foundation models.
31+
materials:
32+
- name: Paper
33+
url: https://arxiv.org/abs/2506.17811
34+
type: file-pdf
35+
- name: Codebase
36+
url: https://github.com/robomonkey-vla/RoboMonkey
37+
type: code
38+
- name: Datasets and Models
39+
url: https://huggingface.co/robomonkey-vla
40+
type: database
41+
- name: Serving Engine
42+
url: https://github.com/robomonkey-vla/sglang-vla
43+
type: code
44+
---
45+
46+
Vision-Language-Action (VLA) models have demonstrated remarkable capabilities in visuomotor control, yet ensuring their robustness in unstructured real-world environments remains a persistent challenge. In this paper, we investigate test-time scaling through the lens of sampling and verification as means to enhance the robustness and generalization of VLAs. We first demonstrate that the relationship between action error and the number of generated samples follows an exponentiated power law across a range of VLAs, indicating the existence of inference-time scaling laws. Building on these insights, we introduce RoboMonkey, a test-time scaling framework for VLAs. At deployment, RoboMonkey samples a small set of actions from a VLA, applies Gaussian perturbation and majority voting to construct an action proposal distribution, and then uses a Vision Language Model (VLM)-based verifier to select the optimal action. We propose a synthetic data generation pipeline for training such VLM-based action verifiers, and demonstrate that scaling the synthetic dataset consistently improves verification and downstream accuracy. Through extensive simulated and hardware experiments, we show that pairing existing VLAs with RoboMonkey yields significant performance gains, achieving a 25% absolute improvement on out-of-distribution tasks and 9% on in-distribution tasks. Additionally, when adapting to new robot setups, we show that fine-tuning both VLAs and action verifiers yields a 7% performance increase compared to fine-tuning VLAs alone.

imgs/people/ionstoica.jpg

1.31 MB
Loading

imgs/people/jackykwok.jpg

128 KB
Loading

imgs/people/marcopavone.jpg

240 KB
Loading

imgs/teasers/robomonkey.png

640 KB
Loading

imgs/thumbs/robomonkey.png

797 KB
Loading

pubs/robomonkey.pdf

9.25 MB
Binary file not shown.

0 commit comments

Comments
 (0)