Skip to content

Commit 7945c08

Browse files
authored
Merge branch 'main' into main
2 parents 64bb3c8 + 7268a4a commit 7945c08

File tree

4 files changed

+95
-47
lines changed

4 files changed

+95
-47
lines changed

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ repos:
1515
# - id: trailing-whitespace
1616

1717
- repo: https://github.com/astral-sh/ruff-pre-commit
18-
rev: v0.5.5
18+
rev: v0.6.4
1919
hooks:
2020
# Run the linter.
2121
- id: ruff

README.md

Lines changed: 34 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,19 @@
66
This repo contains the evaluation code for the paper "[SciCode: A Research Coding Benchmark Curated by Scientists](https://arxiv.org/abs/2407.13168)"
77

88
## 🔔News
9-
- **[2024-08-22]: The SciCode benchmark has been successfully integrated into [OpenCompass](https://github.com/open-compass/opencompass).**
10-
- **[2024-07-24]: We add the scientist-annotated background and support setup for w/ background evaluation.**
9+
10+
**[2024-11-04]: Leaderboard is on! Check [here](https://scicode-bench.github.io/leaderboard/). We have also added Claude Sonnet 3.5 (new) results.**
11+
12+
**[2024-10-01]: We have added OpenAI o1-mini and o1-preview results.**
13+
14+
**[2024-08-22]: The SciCode benchmark has been successfully integrated into [OpenCompass](https://github.com/open-compass/opencompass).**
15+
16+
**[2024-09-26]: SciCode is accepted at NeurIPS D&B Track 2024.**
17+
18+
**[2024-07-24]: We add the scientist-annotated background and support setup for w/ background evaluation.**
1119

1220
## Introduction
13-
SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of **16** subdomains from **6** domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains **338** subproblems decomposed from **80** challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only **4.6%** of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI.
21+
SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of **16** subdomains from **6** domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains **338** subproblems decomposed from **80** challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. OpenAI o1-preview, the best-performing model among those tested, can solve only **7.7%** of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI.
1422

1523

1624

@@ -19,19 +27,24 @@ SciCode sources challenging and realistic research-level coding problems across
1927

2028
## 🏆 Leaderboard
2129

22-
| Model | Subproblem | Main Problem |
23-
|---------------------------|------------|--------------|
24-
| Claude3.5-Sonnet | **26** | **4.6** |
25-
| GPT-4o | 25 | 1.5 |
26-
| GPT-4-Turbo | 22.9 | 1.5 |
27-
| Gemini 1.5 Pro | 21.9 | 1.5 |
28-
| Claude3-Opus | 21.5 | 1.5 |
29-
| Deepseek-Coder-v2 | 21.2 | 3.1 |
30-
| Claude3-Sonnet | 17 | 1.5 |
31-
| Qwen2-72B-Instruct | 17 | 1.5 |
32-
| Llama-3.1-70B-Instruct | 16.3 | 1.5 |
33-
| Mixtral-8x22B-Instruct | 16.3 | 0 |
34-
| Llama-3-70B-Chat | 14.6 | 0 |
30+
| Models | Main Problem Resolve Rate | <span style="color:grey">Subproblem</span> |
31+
|--------------------------|-------------------------------------|-------------------------------------|
32+
| 🥇 OpenAI o1-preview | <div align="center">**7.7**</div> | <div align="center" style="color:grey">28.5</div> |
33+
| 🥈 Claude3.5-Sonnet | <div align="center">**4.6**</div> | <div align="center" style="color:grey">26.0</div> |
34+
| 🥉 Claude3.5-Sonnet (new) | <div align="center">**4.6**</div> | <div align="center" style="color:grey">25.3</div> |
35+
| Deepseek-Coder-v2 | <div align="center">**3.1**</div> | <div align="center" style="color:grey">21.2</div> |
36+
| GPT-4o | <div align="center">**1.5**</div> | <div align="center" style="color:grey">25.0</div> |
37+
| GPT-4-Turbo | <div align="center">**1.5**</div> | <div align="center" style="color:grey">22.9</div> |
38+
| OpenAI o1-mini | <div align="center">**1.5**</div> | <div align="center" style="color:grey">22.2</div> |
39+
| Gemini 1.5 Pro | <div align="center">**1.5**</div> | <div align="center" style="color:grey">21.9</div> |
40+
| Claude3-Opus | <div align="center">**1.5**</div> | <div align="center" style="color:grey">21.5</div> |
41+
| Llama-3.1-405B-Chat | <div align="center">**1.5**</div> | <div align="center" style="color:grey">19.8</div> |
42+
| Claude3-Sonnet | <div align="center">**1.5**</div> | <div align="center" style="color:grey">17.0</div> |
43+
| Qwen2-72B-Instruct | <div align="center">**1.5**</div> | <div align="center" style="color:grey">17.0</div> |
44+
| Llama-3.1-70B-Chat | <div align="center">**0.0**</div> | <div align="center" style="color:grey">17.0</div> |
45+
| Mixtral-8x22B-Instruct | <div align="center">**0.0**</div> | <div align="center" style="color:grey">16.3</div> |
46+
| Llama-3-70B-Chat | <div align="center">**0.0**</div> | <div align="center" style="color:grey">14.6</div> |
47+
3548

3649
## Instructions to evaluate a new model
3750

@@ -41,6 +54,11 @@ SciCode sources challenging and realistic research-level coding problems across
4154
4. Run `eval/scripts/gencode_json.py` to generate new model outputs (see the [`eval/scripts` readme](eval/scripts/)) for more information
4255
5. Run `eval/scripts/test_generated_code.py` to evaluate the unittests
4356

57+
## More information and FAQ
58+
59+
More information, including a [FAQ section](https://scicode-bench.github.io/faq/), is provided on our [website](https://scicode-bench.github.io/).
60+
If you have trouble reaching the website, please find the markdown source in its [github repository](https://github.com/scicode-bench/scicode-bench.github.io/tree/main/docs).
61+
4462
## Contact
4563
- Minyang Tian: mtian8@illinois.edu
4664
- Eliu Huerta: elihu@anl.gov

eval/scripts/README.md

Lines changed: 47 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,42 +1,72 @@
1-
## **Generate LLM code**
2-
3-
Your first need to set up your API keys. For this, create a `keys.cfg` file at the root of the repository
4-
and add keys as follows:
1+
## **Generating Code with LLMs**
2+
3+
### 1. Set Up Your API Keys
4+
5+
First, create a `keys.cfg` file at the root of the repository and add your API keys for the different providers as follows:
56

67
```
78
OPENAI_KEY = 'your_api_key'
89
ANTHROPIC_KEY = 'your_api_key'
9-
GOOGLE_KEY = 'your_api_key' 
10+
GOOGLE_KEY = 'your_api_key'
11+
```
12+
13+
If you're using **litellm**, which supports a variety of providers including **vllm**, **Hugging Face**, and **Together AI**, make sure to include the relevant API key in the `keys.cfg` file. Please refer to the docs [here](https://docs.litellm.ai/docs/providers). Then, use `litellm/*` as the model name when running the command.
14+
15+
For example, to use **Together AI**'s models, you'll need to add the following to your `keys.cfg`:
16+
17+
```
18+
TOGETHERAI_API_KEY = 'your_api_key'
1019
```
1120

12-
For example, to create model results with `gpt-4o` and the default settings, go to the root of this repo and run
21+
### 2. Generating Code
22+
23+
To generate code using the **Together AI** model (e.g., `Meta-Llama-3.1-70B-Instruct-Turbo`), go to the root of this repo and run:
24+
25+
```bash
26+
python eval/scripts/gencode_json.py --model litellm/together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
27+
```
28+
29+
To generate code using **GPT-4o** (with default settings), go to the root of this repo and run:
1330

1431
```bash
1532
python eval/scripts/gencode_json.py --model gpt-4o
1633
```
1734

18-
For results with scientist-annotated background, run
35+
If you want to include **scientist-annotated background** in the prompts, use the `--with-background` flag:
1936

2037
```bash
2138
python eval/scripts/gencode_json.py --model gpt-4o --with-background
2239
```
2340

41+
Please note that we do not plan to release the ground truth code for each problem to the public. However, we have made a dev set available that includes the ground truth code in `eval/data/problems_dev.jsonl`.
42+
43+
In this repository, **we only support evaluating with previously generated code for each step.**
2444

2545
### Command-Line Arguments
2646

27-
- `--model` - Specifies the model name used for generating responses.
28-
- `--output-dir` - Directory to store the generated code outputs (Default: `eval_results/generated_code`).
29-
- `--input-path` - Directory containing the JSON files describing the problems (Default: `eval/data/problems_all.jsonl`).
30-
- `--prompt-dir` - Directory where prompt files are saved (Default: `eval_results/prompt`).
31-
- `--with-background` - Include problem background if enabled.
32-
- `--temperature` - Controls the randomness of the generation (Default: 0).
33-
34-
## **Evaluate generated code**
47+
When running the `gencode_json.py` script, you can use the following options:
48+
49+
- `--model`: Specifies the model name to be used for generating code (e.g., `gpt-4o` or `litellm/together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo`).
50+
- `--output-dir`: Directory where the generated code outputs will be saved. Default is `eval_results/generated_code`.
51+
- `--input-path`: Directory containing the JSON files describing the problems. Default is `eval/data/problems_all.jsonl`.
52+
- `--prompt-dir`: Directory where prompt files are saved. Default is `eval_results/prompt`.
53+
- `--with-background`: If enabled, includes the problem background in the generated code.
54+
- `--temperature`: Controls the randomness of the output. Default is 0.
55+
56+
---
3557

36-
Download the [numeric test results](https://drive.google.com/drive/folders/1W5GZW6_bdiDAiipuFMqdUhvUaHIj6-pR?usp=drive_link) and save them as `./eval/data/test_data.h5`
58+
## **Evaluating the Generated Code**
3759

38-
To run the script, go to the root of this repo and use the following command:
60+
### 1. Download Numeric Test Data
61+
62+
Download the [numeric test results](https://drive.google.com/drive/folders/1W5GZW6_bdiDAiipuFMqdUhvUaHIj6-pR?usp=drive_link) and save it as `eval/data/test_data.h5`.
63+
64+
### 2. Run the Evaluation
65+
66+
To evaluate the generated code using a specific model, go to the root of this repo and use the following command:
3967

4068
```bash
4169
python eval/scripts/test_generated_code.py --model "model_name"
4270
```
71+
72+
Replace `"model_name"` with the appropriate model name, and include `--with-background` if the code is generated with **scientist-annotated background**.

eval/scripts/gencode_json.py

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,10 @@
88
)
99
from scicode.gen.models import extract_python_script, get_model_function
1010

11-
1211
DEFAULT_PROMPT_TEMPLATE = Path("eval", "data", "background_comment_template.txt").read_text()
1312
BACKGOUND_PROMPT_TEMPLATE = Path("eval", "data", "multistep_template.txt").read_text()
1413

14+
1515
class Gencode:
1616
def __init__(self, model: str, output_dir: Path,
1717
prompt_dir: Path, with_background: bool, temperature: float):
@@ -57,6 +57,10 @@ def generate_response_with_steps(
5757
save (bool, optional): Save propmt and model response. Defaults to True.
5858
"""
5959
prob_id = prob_data["problem_id"]
60+
output_file_path = (
61+
self.output_dir / Path(self.model).parts[-1] / self._get_background_dir()
62+
/ f"{prob_id}.{num_steps}.py"
63+
)
6064
if num_steps == 1:
6165
self.previous_llm_code = [None] * tot_steps
6266
else:
@@ -69,8 +73,7 @@ def generate_response_with_steps(
6973
prev_file_path = Path("eval", "data", f"{prob_id}.{prev_step+1}.txt")
7074
else:
7175
prev_file_path = (
72-
self.output_dir
73-
/ model
76+
self.output_dir / Path(self.model).parts[-1] / self._get_background_dir()
7477
/ f"{prob_id}.{prev_step + 1}.py"
7578
)
7679
if prev_file_path.is_file():
@@ -80,6 +83,9 @@ def generate_response_with_steps(
8083
self.previous_llm_code[prev_step] = function_code
8184
else:
8285
raise Exception(f'Generating {prob_id} step {num_steps} ahead of step {prev_step + 1}.')
86+
87+
if output_file_path.exists():
88+
return
8389
prompt, previous_code = self.generate_prompt_with_steps(prob_data, num_steps, prompt_template)
8490
if save:
8591
self.save_prompt_with_steps(prob_data, prompt, num_steps)
@@ -89,16 +95,10 @@ def generate_response_with_steps(
8995
model_kwargs["max_tokens"] = 4096
9096
model_kwargs["temperature"] = self.temperature
9197
# write the response to a file if it doesn't exist
92-
output_file_path = (
93-
self.output_dir
94-
/ model
95-
/ f"{prob_id}.{num_steps}.py"
96-
)
97-
if not output_file_path.exists():
98-
model_fct = get_model_function(model, **model_kwargs)
99-
response_from_llm = model_fct(prompt)
100-
self.previous_llm_code[num_steps - 1] = extract_python_script(response_from_llm)
101-
self.save_response_with_steps(prob_data, response_from_llm, previous_code, num_steps)
98+
model_fct = get_model_function(model, **model_kwargs)
99+
response_from_llm = model_fct(prompt)
100+
self.previous_llm_code[num_steps - 1] = extract_python_script(response_from_llm)
101+
self.save_response_with_steps(prob_data, response_from_llm, previous_code, num_steps)
102102

103103
@staticmethod
104104
def process_problem_code(prob_data: dict, num_steps: int) -> str:

0 commit comments

Comments
 (0)