Skip to content

Commit 4214411

Browse files
committed
Weaver Paper Upload
1 parent 03118b9 commit 4214411

File tree

8 files changed

+63
-5
lines changed

8 files changed

+63
-5
lines changed

_data/people.yml

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -28,12 +28,17 @@ ryanehrlich:
2828
url: https://www.linkedin.com/in/ryan-ehrlich-68a60b11a/
2929
title: Researcher
3030

31+
# Masters
32+
3133
shloknatarajan:
3234
name: Shlok Natarajan
3335
url: https://www.linkedin.com/in/shloknatarajan/
3436
title: Master's Researcher
3537

36-
# Masters
38+
brendanmclaughlin:
39+
name: Brendan McLaughlin
40+
url: https://www.brendanmclaughlin.me/
41+
title: Master's Researcher
3742

3843
# Undergrads
3944

@@ -42,10 +47,10 @@ tanvirbhathal:
4247
url: https://www.linkedin.com/in/tanvir-bhathal/
4348
title: Undergraduate Student
4449

45-
robbymanihani:
46-
name: Robby Manihani
47-
url: https://www.linkedin.com/in/grmanihani/
48-
title: Undergraduate Student
50+
# robbymanihani:
51+
# name: Robby Manihani
52+
# url: https://www.linkedin.com/in/grmanihani/
53+
# title: Undergraduate Student
4954

5055
caiacostello:
5156
name: Caia Costello

_pubs/weaver.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
---
2+
title: 'Weaver: Shrinking the Generation-Verification Gap with Weak Verifiers'
3+
authors:
4+
- key: jonsaadfalcon
5+
equal: true
6+
affiliation: Stanford
7+
- name: E. Kelly Buchanan
8+
equal: true
9+
affiliation: Stanford University
10+
- name: Mayee F. Chen
11+
equal: true
12+
affiliation: Stanford University
13+
- name: Tzu-Heng Huang
14+
affiliation: University of Wisconsin-Madison
15+
- name: Brendan McLaughlin
16+
affiliation: Stanford University
17+
- key: tanvirbhathal
18+
- name: Shang Zhu
19+
affiliation: Together AI
20+
- name: Ben Athiwaratkun
21+
affiliation: Together AI
22+
- name: Frederic Sala
23+
affiliation: University of Wisconsin-Madison
24+
- name: Scott Linderman
25+
affiliation: Stanford University
26+
- key: azaliamirhoseini
27+
affiliation: Stanford University
28+
- name: Christopher Ré
29+
affiliation: Stanford University
30+
venue: preprint
31+
year: 2025
32+
date: 2025-06-24
33+
has_pdf: true
34+
doi: 10.48550/arXiv.2506.18203
35+
tags:
36+
- machine learning
37+
- generative ai
38+
teaser: Weaver boosts language model performance by intelligently combining weak verifiers using weak supervision, achieving near-oracle accuracy with drastically reduced compute.
39+
materials:
40+
- name: Paper
41+
url: https://arxiv.org/abs/2506.18203
42+
type: file-pdf
43+
- name: Codebase
44+
url: https://github.com/HazyResearch/scaling-verification
45+
type: code
46+
- name: Blog post
47+
url: https://hazyresearch.stanford.edu/blog/2025-06-18-weaver
48+
type: link
49+
- name: Datasets and Models
50+
url: https://huggingface.co/collections/hazyresearch/weaver-683798010b39c9653ddb9bd8
51+
type: database
52+
---
53+
Verifiers can improve language model capabilities by scoring and ranking responses from generated candidates. Currently, high-quality verifiers are either unscalable (e.g., humans) or limited in utility (e.g., tools like Lean). While LM judges and reward models have become broadly useful as general-purpose verifiers, a significant performance gap remains between them and oracle verifiers (verifiers with perfect accuracy). To help close this gap, we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers. We find weighted ensembles of verifiers, which typically require learning from labeled data, significantly outperform unweighted combinations due to differences in verifier accuracies. To reduce dependency on labeled data, Weaver leverages weak supervision to estimate each verifier's accuracy and combines outputs into a unified score that better reflects true response quality. However, directly applying weak supervision algorithms poses challenges, including inconsistent verifier output formats and handling low-quality verifiers. Weaver addresses these using dataset statistics to normalize outputs and filter specific verifiers. We study Weaver's effectiveness in test-time repeated sampling, where a model generates multiple candidate responses and selects one. Our evaluations show Weaver significantly improves over Pass@1-performance when selecting the first candidate-across reasoning and math tasks, achieving o3-mini-level accuracy with Llama 3.3 70B Instruct as generator, and an ensemble of 70B or smaller judge and reward models as verifiers (87.7% average). This gain mirrors the jump between GPT-4o and o3-mini (69.0% vs. 86.7%), which required extensive finetuning and post-training. To reduce computational costs of verifier ensembles, we train a 400M cross-encoder using Weaver's combined output scores.

imgs/people/brendanmclaughlin.jpg

1.31 MB
Loading

imgs/people/tanvirbhathal.jpg

562 KB
Loading

imgs/people/tanvirbhathal_2.jpg

162 KB
Loading

imgs/teasers/weaver.png

264 KB
Loading

imgs/thumbs/weaver.png

2.25 MB
Loading

pubs/weaver.pdf

7.85 MB
Binary file not shown.

0 commit comments

Comments
 (0)