ScalingIntelligence
diff --git a/‎_blogs/monkeys.md
Lines changed: 1 addition & 1 deletion b/‎_blogs/monkeys.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎_blogs/tokasaurus.md
Lines changed: 9 additions & 12 deletions b/‎_blogs/tokasaurus.md
Lines changed: 9 additions & 12 deletions
diff --git a/‎_pubs/codemonkeys.md
Lines changed: 37 additions & 0 deletions b/‎_pubs/codemonkeys.md
Lines changed: 37 additions & 0 deletions
diff --git a/‎imgs/blog/tokasaurus/saurus.png
579 KB b/‎imgs/blog/tokasaurus/saurus.png
579 KB
diff --git a/‎imgs/teasers/codemonkeys.png
387 KB b/‎imgs/teasers/codemonkeys.png
387 KB
diff --git a/‎imgs/teasers/tokasaurus.png
3.09 MB b/‎imgs/teasers/tokasaurus.png
3.09 MB
diff --git a/‎imgs/thumbs/codemonkeys.png
705 KB b/‎imgs/thumbs/codemonkeys.png
705 KB
diff --git a/‎imgs/thumbs/tokasaurus.png
3.09 MB b/‎imgs/thumbs/tokasaurus.png
3.09 MB
@@ -1,5 +1,5 @@
 ---
-title: "Monkey Business: a dataset of large LLM sample collections for math and code tasks"
+title: "Monkey Business: a Dataset of Large LLM Sample Collections for Math and Code Tasks"
 authors:
   - key: bradleybrown
     affiliation: University of Oxford
 
@@ -1,5 +1,5 @@
 ---
-title: "Tokasaurus: An inference engine for high-throughput workloads"
+title: "Tokasaurus: An Inference Engine for High-Throughput Workloads"
 authors:
   - key: jordanjuravsky
 tags:
@@ -8,7 +8,7 @@ tags:
 venue: none
 year: 2025
 date: 2025-06-05
-teaser:
+teaser: The saurus with a toke
 materials:
   - name: Codebase
     url: https://github.com/ScalingIntelligence/tokasaurus
@@ -100,13 +100,13 @@ Currently, we support models from the Llama-3 and Qwen-2 families and support an
 
 Tokasaurus is written in pure Python with no custom kernels (although we do use attention and sampling ops from the excellent FlashInfer [link] package). We hope that this makes the engine easier to fork and hack on, a la GPT-fast [link].
 
-## Benchmarking Details:
+## Benchmarking Details
 
 The commands for reproducing our benchmarks are available here [link]. For each benchmark, we configure all engines with the same KV cache size and maximum number of running requests. We’ve made a best effort to tune each engine’s remaining parameters. We report the average throughput across runs after completing a warmup run. For each benchmark, all engines are run on the same machine.
 
 We use this script from SGLang [link] for our ShareGPT benchmarks and this custom script [link] for the Large Language Monkeys benchmark. To standardize our benchmarking scripts and interface, all experiments send requests through the OpenAI API. We also experimented with vLLM’s Python API (i.e. `LLM.generate()`) on the Large Language Monkeys benchmark with Llama-1B and measured roughly a 5% throughput increase.
 
-## Acknowledgements:
+## Acknowledgements
 
 Huge thanks to Prime Intellect and Together AI for providing us with compute for this project.
 
@@ -117,13 +117,10 @@ Also, we’re grateful to Dan Biderman, Simon Guo, Manat Kaur, and Avanika Naray
 <p>How to cite? If you use our dataset or code, please cite the following paper:</p>
 
 ```bibtex
-@misc{brown2024largelanguagemonkeysscaling,
-      title={Large Language Monkeys: Scaling Inference Compute with Repeated Sampling}, 
-      author={Bradley Brown and Jordan Juravsky and Ryan Ehrlich and Ronald Clark and Quoc V. Le and Christopher Ré and Azalia Mirhoseini},
-      year={2024},
-      eprint={2407.21787},
-      archivePrefix={arXiv},
-      primaryClass={cs.LG},
-      url={https://arxiv.org/abs/2407.21787}, 
+@misc{juravsky2025tokasaurus,
+  author       = {},
+  title        = {Tokasaurus: An Inference Engine for High-Throughput Workloads},
+  year         = {2025},
+  howpublished = {\url{https://scalingintelligence.stanford.edu/blogs/tokasaurus/}}
 }
 ```
@@ -0,0 +1,37 @@
+---
+title: 'CodeMonkeys: Scaling Test-Time Compute for Software Engineering'
+authors:
+  - key: ryanehrlich
+    equal: true
+  - key: bradleybrown
+    affiliation: University of Oxford
+    equal: true
+  - key: jordanjuravsky
+    equal: true
+  - name: Ronald Clark
+    affiliation: University of Oxford
+  - name: Christopher Ré
+    affiliation: Stanford
+  - key: azaliamirhoseini
+venue: preprint
+year: 2025
+day: 212
+has_pdf: true
+doi: 10.48550/arXiv.2407.21787
+tags:
+  - machine learning
+  - generative AI
+teaser: CodeMonkeys is a system designed to leverage inference time compute in order to solve Github issues in the SWE-Bench Verified dataset.
+materials:
+  - name: Paper (TODO)
+    url: 
+    type: file-pdf
+  - name: Codebase (TODO)
+    url: https://github.com/ScalingIntelligence/codemonkeys
+    type: code
+  - name: Trajectory Data (TODO)
+    url: 
+    type: code
+    display_name: Trajectory Data
+---
+Scaling test-time compute represents a new axis for improving LLM capabilities. However, test-time compute can be scaled in a variety of ways, and effectively combining different approaches remains an active area of research. Here, we present CodeMonkeys, a system designed to leverage test-time compute in order to solve real-world GitHub issues from the SWE-bench dataset. Our approach scales both serial and parallel test-time compute by sampling independent multi-turn trajectories that each iterate in response to execution feedback. Leveraging the ability to amortize up-front costs across multiple downstream samples, we identify relevant codebase context by simply letting a model scan every file. In order to decide among multiple candidate edits, we introduce a selection mechanism combining model-written unit tests with a final multi-turn loop dedicated to selection. Overall, CodeMonkeys solves 57.4\% of issues from SWE-bench Verified with a budget of approximately 2500 USD. When testing our selection method on an ensemble of edits from existing top submissions from the SWE-bench leaderboard, our score increases to 66.2\%.