Skip to content

Commit a57048d

Browse files
tok blog changes
1 parent b6f9c34 commit a57048d

File tree

8 files changed

+47
-13
lines changed

8 files changed

+47
-13
lines changed

_blogs/monkeys.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: "Monkey Business: a dataset of large LLM sample collections for math and code tasks"
2+
title: "Monkey Business: a Dataset of Large LLM Sample Collections for Math and Code Tasks"
33
authors:
44
- key: bradleybrown
55
affiliation: University of Oxford

_blogs/tokasaurus.md

Lines changed: 9 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: "Tokasaurus: An inference engine for high-throughput workloads"
2+
title: "Tokasaurus: An Inference Engine for High-Throughput Workloads"
33
authors:
44
- key: jordanjuravsky
55
tags:
@@ -8,7 +8,7 @@ tags:
88
venue: none
99
year: 2025
1010
date: 2025-06-05
11-
teaser:
11+
teaser: The saurus with a toke
1212
materials:
1313
- name: Codebase
1414
url: https://github.com/ScalingIntelligence/tokasaurus
@@ -100,13 +100,13 @@ Currently, we support models from the Llama-3 and Qwen-2 families and support an
100100

101101
Tokasaurus is written in pure Python with no custom kernels (although we do use attention and sampling ops from the excellent FlashInfer [link] package). We hope that this makes the engine easier to fork and hack on, a la GPT-fast [link].
102102

103-
## Benchmarking Details:
103+
## Benchmarking Details
104104

105105
The commands for reproducing our benchmarks are available here [link]. For each benchmark, we configure all engines with the same KV cache size and maximum number of running requests. We’ve made a best effort to tune each engine’s remaining parameters. We report the average throughput across runs after completing a warmup run. For each benchmark, all engines are run on the same machine.
106106

107107
We use this script from SGLang [link] for our ShareGPT benchmarks and this custom script [link] for the Large Language Monkeys benchmark. To standardize our benchmarking scripts and interface, all experiments send requests through the OpenAI API. We also experimented with vLLM’s Python API (i.e. `LLM.generate()`) on the Large Language Monkeys benchmark with Llama-1B and measured roughly a 5% throughput increase.
108108

109-
## Acknowledgements:
109+
## Acknowledgements
110110

111111
Huge thanks to Prime Intellect and Together AI for providing us with compute for this project.
112112

@@ -117,13 +117,10 @@ Also, we’re grateful to Dan Biderman, Simon Guo, Manat Kaur, and Avanika Naray
117117
<p>How to cite? If you use our dataset or code, please cite the following paper:</p>
118118

119119
```bibtex
120-
@misc{brown2024largelanguagemonkeysscaling,
121-
title={Large Language Monkeys: Scaling Inference Compute with Repeated Sampling},
122-
author={Bradley Brown and Jordan Juravsky and Ryan Ehrlich and Ronald Clark and Quoc V. Le and Christopher Ré and Azalia Mirhoseini},
123-
year={2024},
124-
eprint={2407.21787},
125-
archivePrefix={arXiv},
126-
primaryClass={cs.LG},
127-
url={https://arxiv.org/abs/2407.21787},
120+
@misc{juravsky2025tokasaurus,
121+
author = {},
122+
title = {Tokasaurus: An Inference Engine for High-Throughput Workloads},
123+
year = {2025},
124+
howpublished = {\url{https://scalingintelligence.stanford.edu/blogs/tokasaurus/}}
128125
}
129126
```

_pubs/codemonkeys.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
---
2+
title: 'CodeMonkeys: Scaling Test-Time Compute for Software Engineering'
3+
authors:
4+
- key: ryanehrlich
5+
equal: true
6+
- key: bradleybrown
7+
affiliation: University of Oxford
8+
equal: true
9+
- key: jordanjuravsky
10+
equal: true
11+
- name: Ronald Clark
12+
affiliation: University of Oxford
13+
- name: Christopher Ré
14+
affiliation: Stanford
15+
- key: azaliamirhoseini
16+
venue: preprint
17+
year: 2025
18+
day: 212
19+
has_pdf: true
20+
doi: 10.48550/arXiv.2407.21787
21+
tags:
22+
- machine learning
23+
- generative AI
24+
teaser: CodeMonkeys is a system designed to leverage inference time compute in order to solve Github issues in the SWE-Bench Verified dataset.
25+
materials:
26+
- name: Paper (TODO)
27+
url:
28+
type: file-pdf
29+
- name: Codebase (TODO)
30+
url: https://github.com/ScalingIntelligence/codemonkeys
31+
type: code
32+
- name: Trajectory Data (TODO)
33+
url:
34+
type: code
35+
display_name: Trajectory Data
36+
---
37+
Scaling test-time compute represents a new axis for improving LLM capabilities. However, test-time compute can be scaled in a variety of ways, and effectively combining different approaches remains an active area of research. Here, we present CodeMonkeys, a system designed to leverage test-time compute in order to solve real-world GitHub issues from the SWE-bench dataset. Our approach scales both serial and parallel test-time compute by sampling independent multi-turn trajectories that each iterate in response to execution feedback. Leveraging the ability to amortize up-front costs across multiple downstream samples, we identify relevant codebase context by simply letting a model scan every file. In order to decide among multiple candidate edits, we introduce a selection mechanism combining model-written unit tests with a final multi-turn loop dedicated to selection. Overall, CodeMonkeys solves 57.4\% of issues from SWE-bench Verified with a budget of approximately 2500 USD. When testing our selection method on an ensemble of edits from existing top submissions from the SWE-bench leaderboard, our score increases to 66.2\%.

imgs/blog/tokasaurus/saurus.png

579 KB
Loading

imgs/teasers/codemonkeys.png

387 KB
Loading

imgs/teasers/tokasaurus.png

3.09 MB
Loading

imgs/thumbs/codemonkeys.png

705 KB
Loading

imgs/thumbs/tokasaurus.png

3.09 MB
Loading

0 commit comments

Comments
 (0)