You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+34-16Lines changed: 34 additions & 16 deletions
Original file line number
Diff line number
Diff line change
@@ -6,11 +6,19 @@
6
6
This repo contains the evaluation code for the paper "[SciCode: A Research Coding Benchmark Curated by Scientists](https://arxiv.org/abs/2407.13168)"
7
7
8
8
## 🔔News
9
-
-**[2024-08-22]: The SciCode benchmark has been successfully integrated into [OpenCompass](https://github.com/open-compass/opencompass).**
10
-
-**[2024-07-24]: We add the scientist-annotated background and support setup for w/ background evaluation.**
9
+
10
+
**[2024-11-04]: Leaderboard is on! Check [here](https://scicode-bench.github.io/leaderboard/). We have also added Claude Sonnet 3.5 (new) results.**
11
+
12
+
**[2024-10-01]: We have added OpenAI o1-mini and o1-preview results.**
13
+
14
+
**[2024-08-22]: The SciCode benchmark has been successfully integrated into [OpenCompass](https://github.com/open-compass/opencompass).**
15
+
16
+
**[2024-09-26]: SciCode is accepted at NeurIPS D&B Track 2024.**
17
+
18
+
**[2024-07-24]: We add the scientist-annotated background and support setup for w/ background evaluation.**
11
19
12
20
## Introduction
13
-
SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of **16** subdomains from **6** domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains **338** subproblems decomposed from **80** challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only **4.6%** of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI.
21
+
SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of **16** subdomains from **6** domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains **338** subproblems decomposed from **80** challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. OpenAI o1-preview, the best-performing model among those tested, can solve only **7.7%** of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI.
14
22
15
23
16
24
@@ -19,19 +27,24 @@ SciCode sources challenging and realistic research-level coding problems across
@@ -41,6 +54,11 @@ SciCode sources challenging and realistic research-level coding problems across
41
54
4. Run `eval/scripts/gencode_json.py` to generate new model outputs (see the [`eval/scripts` readme](eval/scripts/)) for more information
42
55
5. Run `eval/scripts/test_generated_code.py` to evaluate the unittests
43
56
57
+
## More information and FAQ
58
+
59
+
More information, including a [FAQ section](https://scicode-bench.github.io/faq/), is provided on our [website](https://scicode-bench.github.io/).
60
+
If you have trouble reaching the website, please find the markdown source in its [github repository](https://github.com/scicode-bench/scicode-bench.github.io/tree/main/docs).
Your first need to set up your API keys. For this, create a `keys.cfg` file at the root of the repository
4
-
and add keys as follows:
1
+
## **Generating Code with LLMs**
2
+
3
+
### 1. Set Up Your API Keys
4
+
5
+
First, create a `keys.cfg` file at the root of the repository and add your API keys for the different providers as follows:
5
6
6
7
```
7
8
OPENAI_KEY = 'your_api_key'
8
9
ANTHROPIC_KEY = 'your_api_key'
9
-
GOOGLE_KEY = 'your_api_key'
10
+
GOOGLE_KEY = 'your_api_key'
11
+
```
12
+
13
+
If you're using **litellm**, which supports a variety of providers including **vllm**, **Hugging Face**, and **Together AI**, make sure to include the relevant API key in the `keys.cfg` file. Please refer to the docs [here](https://docs.litellm.ai/docs/providers). Then, use `litellm/*` as the model name when running the command.
14
+
15
+
For example, to use **Together AI**'s models, you'll need to add the following to your `keys.cfg`:
16
+
17
+
```
18
+
TOGETHERAI_API_KEY = 'your_api_key'
10
19
```
11
20
12
-
For example, to create model results with `gpt-4o` and the default settings, go to the root of this repo and run
21
+
### 2. Generating Code
22
+
23
+
To generate code using the **Together AI** model (e.g., `Meta-Llama-3.1-70B-Instruct-Turbo`), go to the root of this repo and run:
Please note that we do not plan to release the ground truth code for each problem to the public. However, we have made a dev set available that includes the ground truth code in `eval/data/problems_dev.jsonl`.
42
+
43
+
In this repository, **we only support evaluating with previously generated code for each step.**
24
44
25
45
### Command-Line Arguments
26
46
27
-
-`--model` - Specifies the model name used for generating responses.
28
-
-`--output-dir` - Directory to store the generated code outputs (Default: `eval_results/generated_code`).
29
-
-`--input-path` - Directory containing the JSON files describing the problems (Default: `eval/data/problems_all.jsonl`).
30
-
-`--prompt-dir` - Directory where prompt files are saved (Default: `eval_results/prompt`).
31
-
-`--with-background` - Include problem background if enabled.
32
-
-`--temperature` - Controls the randomness of the generation (Default: 0).
33
-
34
-
## **Evaluate generated code**
47
+
When running the `gencode_json.py` script, you can use the following options:
48
+
49
+
-`--model`: Specifies the model name to be used for generating code (e.g., `gpt-4o` or `litellm/together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo`).
50
+
-`--output-dir`: Directory where the generated code outputs will be saved. Default is `eval_results/generated_code`.
51
+
-`--input-path`: Directory containing the JSON files describing the problems. Default is `eval/data/problems_all.jsonl`.
52
+
-`--prompt-dir`: Directory where prompt files are saved. Default is `eval_results/prompt`.
53
+
-`--with-background`: If enabled, includes the problem background in the generated code.
54
+
-`--temperature`: Controls the randomness of the output. Default is 0.
55
+
56
+
---
35
57
36
-
Download the [numeric test results](https://drive.google.com/drive/folders/1W5GZW6_bdiDAiipuFMqdUhvUaHIj6-pR?usp=drive_link) and save them as `./eval/data/test_data.h5`
58
+
## **Evaluating the Generated Code**
37
59
38
-
To run the script, go to the root of this repo and use the following command:
60
+
### 1. Download Numeric Test Data
61
+
62
+
Download the [numeric test results](https://drive.google.com/drive/folders/1W5GZW6_bdiDAiipuFMqdUhvUaHIj6-pR?usp=drive_link) and save it as `eval/data/test_data.h5`.
63
+
64
+
### 2. Run the Evaluation
65
+
66
+
To evaluate the generated code using a specific model, go to the root of this repo and use the following command:
Replace `"model_name"` with the appropriate model name, and include `--with-background` if the code is generated with **scientist-annotated background**.
0 commit comments