You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -100,13 +100,13 @@ Currently, we support models from the Llama-3 and Qwen-2 families and support an
100
100
101
101
Tokasaurus is written in pure Python with no custom kernels (although we do use attention and sampling ops from the excellent FlashInfer [link] package). We hope that this makes the engine easier to fork and hack on, a la GPT-fast [link].
102
102
103
-
## Benchmarking Details:
103
+
## Benchmarking Details
104
104
105
105
The commands for reproducing our benchmarks are available here [link]. For each benchmark, we configure all engines with the same KV cache size and maximum number of running requests. We’ve made a best effort to tune each engine’s remaining parameters. We report the average throughput across runs after completing a warmup run. For each benchmark, all engines are run on the same machine.
106
106
107
107
We use this script from SGLang [link] for our ShareGPT benchmarks and this custom script [link] for the Large Language Monkeys benchmark. To standardize our benchmarking scripts and interface, all experiments send requests through the OpenAI API. We also experimented with vLLM’s Python API (i.e. `LLM.generate()`) on the Large Language Monkeys benchmark with Llama-1B and measured roughly a 5% throughput increase.
108
108
109
-
## Acknowledgements:
109
+
## Acknowledgements
110
110
111
111
Huge thanks to Prime Intellect and Together AI for providing us with compute for this project.
112
112
@@ -117,13 +117,10 @@ Also, we’re grateful to Dan Biderman, Simon Guo, Manat Kaur, and Avanika Naray
117
117
<p>How to cite? If you use our dataset or code, please cite the following paper:</p>
118
118
119
119
```bibtex
120
-
@misc{brown2024largelanguagemonkeysscaling,
121
-
title={Large Language Monkeys: Scaling Inference Compute with Repeated Sampling},
122
-
author={Bradley Brown and Jordan Juravsky and Ryan Ehrlich and Ronald Clark and Quoc V. Le and Christopher Ré and Azalia Mirhoseini},
123
-
year={2024},
124
-
eprint={2407.21787},
125
-
archivePrefix={arXiv},
126
-
primaryClass={cs.LG},
127
-
url={https://arxiv.org/abs/2407.21787},
120
+
@misc{juravsky2025tokasaurus,
121
+
author = {},
122
+
title = {Tokasaurus: An Inference Engine for High-Throughput Workloads},
Scaling test-time compute represents a new axis for improving LLM capabilities. However, test-time compute can be scaled in a variety of ways, and effectively combining different approaches remains an active area of research. Here, we present CodeMonkeys, a system designed to leverage test-time compute in order to solve real-world GitHub issues from the SWE-bench dataset. Our approach scales both serial and parallel test-time compute by sampling independent multi-turn trajectories that each iterate in response to execution feedback. Leveraging the ability to amortize up-front costs across multiple downstream samples, we identify relevant codebase context by simply letting a model scan every file. In order to decide among multiple candidate edits, we introduce a selection mechanism combining model-written unit tests with a final multi-turn loop dedicated to selection. Overall, CodeMonkeys solves 57.4\% of issues from SWE-bench Verified with a budget of approximately 2500 USD. When testing our selection method on an ensemble of edits from existing top submissions from the SWE-bench leaderboard, our score increases to 66.2\%.
0 commit comments