Skip to content

Commit 8c518bf

Browse files
authored
Merge branch 'scicode-bench:main' into main
2 parents 01472b0 + d90ef17 commit 8c518bf

File tree

2 files changed

+15
-3
lines changed

2 files changed

+15
-3
lines changed

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ repos:
1515
# - id: trailing-whitespace
1616

1717
- repo: https://github.com/astral-sh/ruff-pre-commit
18-
rev: v0.5.5
18+
rev: v0.6.4
1919
hooks:
2020
# Run the linter.
2121
- id: ruff

README.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,15 @@
66
This repo contains the evaluation code for the paper "[SciCode: A Research Coding Benchmark Curated by Scientists](https://arxiv.org/abs/2407.13168)"
77

88
## 🔔News
9+
10+
**[2024-10-01]: We have added OpenAI o1-mini and o1-preview results.**
11+
12+
**[2024-09-26]: SciCode is accepted at NeurIPS D&B Track 2024.**
13+
914
**[2024-07-24]: We add the scientist-annotated background and support setup for w/ background evaluation.**
1015

1116
## Introduction
12-
SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of **16** subdomains from **6** domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains **338** subproblems decomposed from **80** challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only **4.6%** of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI.
17+
SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of **16** subdomains from **6** domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains **338** subproblems decomposed from **80** challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. OpenAI o1-preview, the best-performing model among those tested, can solve only **7.7%** of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI.
1318

1419

1520

@@ -20,9 +25,11 @@ SciCode sources challenging and realistic research-level coding problems across
2025

2126
| Model | Subproblem | Main Problem |
2227
|---------------------------|------------|--------------|
23-
| Claude3.5-Sonnet | **26** | **4.6** |
28+
| **OpenAI o1-preview** | **28.5** | **7.7** |
29+
| Claude3.5-Sonnet | 26 | 4.6 |
2430
| GPT-4o | 25 | 1.5 |
2531
| GPT-4-Turbo | 22.9 | 1.5 |
32+
| OpenAI o1-mini | 22.2 | 1.5 |
2633
| Gemini 1.5 Pro | 21.9 | 1.5 |
2734
| Claude3-Opus | 21.5 | 1.5 |
2835
| Deepseek-Coder-v2 | 21.2 | 3.1 |
@@ -40,6 +47,11 @@ SciCode sources challenging and realistic research-level coding problems across
4047
4. Run `eval/scripts/gencode_json.py` to generate new model outputs (see the [`eval/scripts` readme](eval/scripts/)) for more information
4148
5. Run `eval/scripts/test_generated_code.py` to evaluate the unittests
4249

50+
## More information and FAQ
51+
52+
More information, including a [FAQ section](https://scicode-bench.github.io/faq/), is provided on our [website](https://scicode-bench.github.io/).
53+
If you have trouble reaching the website, please find the markdown source in its [github repository](https://github.com/scicode-bench/scicode-bench.github.io/tree/main/docs).
54+
4355
## Contact
4456
- Minyang Tian: mtian8@illinois.edu
4557
- Eliu Huerta: elihu@anl.gov

0 commit comments

Comments
 (0)