Skip to content

Commit 8dec269

Browse files
authored
Merge pull request #22 from Zilinghan/main
Integrate `inspect_ai` for evaluation
2 parents 0aae5ce + 919f19a commit 8dec269

File tree

7 files changed

+480
-6
lines changed

7 files changed

+480
-6
lines changed

.github/workflows/tests.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ jobs:
2020
uses: actions/checkout@v2
2121
- uses: actions/setup-python@v5
2222
with:
23-
python-version: '3.9'
23+
python-version: '3.10'
2424
- name: Create dummy keys.cfg
2525
run: touch keys.cfg
2626
- name: Install uv

.gitignore

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,13 @@
22
keys.cfg
33
**/test_result/**
44
**/output/**
5-
**/eval_results/**
5+
**/eval_results*/**
66
eval/logs/**
77
*.h5
8-
8+
logs/**
9+
**/logs/**
10+
**/tmp/**
11+
integration/**
912

1013
# -------
1114

README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ This repo contains the evaluation code for the paper "[SciCode: A Research Codin
77

88
## 🔔News
99

10+
**[2025-01-24]: SciCode has been integrated with [`inspect_ai`](https://inspect.ai-safety-institute.org.uk/) for easier and faster model evaluations.**
11+
1012
**[2024-11-04]: Leaderboard is on! Check [here](https://scicode-bench.github.io/leaderboard/). We have also added Claude Sonnet 3.5 (new) results.**
1113

1214
**[2024-10-01]: We have added OpenAI o1-mini and o1-preview results.**
@@ -54,6 +56,19 @@ SciCode sources challenging and realistic research-level coding problems across
5456
4. Run `eval/scripts/gencode_json.py` to generate new model outputs (see the [`eval/scripts` readme](eval/scripts/)) for more information
5557
5. Run `eval/scripts/test_generated_code.py` to evaluate the unittests
5658

59+
60+
## Instructions to evaluate a new model using `inspect_ai` (recommended)
61+
62+
Scicode has been integrated with `inspect_ai` for easier and faster model evaluation, compared with the methods above. You need to run the first three steps in the [above section](#instructions-to-evaluate-a-new-model), and then go to the `eval/inspect_ai` directory, setup correspoinding API key, and run the following command:
63+
64+
```bash
65+
cd eval/inspect_ai
66+
export OPENAI_API_KEY=your-openai-api-key
67+
inspect eval scicode.py --model openai/gpt-4o --temperature 0
68+
```
69+
70+
For more detailed information of using `inspect_ai`, see [`eval/inspect_ai` readme](eval/inspect_ai/)
71+
5772
## More information and FAQ
5873

5974
More information, including a [FAQ section](https://scicode-bench.github.io/faq/), is provided on our [website](https://scicode-bench.github.io/).

eval/inspect_ai/README.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
## **SciCode Evaluation using `inspect_ai`**
2+
3+
### 1. Set Up Your API Keys
4+
5+
Users can follow [`inspect_ai`'s official documentation](https://inspect.ai-safety-institute.org.uk/#getting-started) to setup correpsonding API keys depending on the types of models they would like to evaluate.
6+
7+
### 2. Setup Command Line Arguments if Needed
8+
9+
In most cases, after users setting up the key, they can directly start the SciCode evaluation via the following command.
10+
11+
```bash
12+
inspect eval scicode.py --model <your_model> --temperature 0
13+
```
14+
15+
However, there are some additional command line arguments that could be useful as well.
16+
17+
- `--max_connections`: Maximum amount of API connections to the evaluated model.
18+
- `--limit`: Limit of the number of samples to evaluate in the SciCode dataset.
19+
- `-T input_path=<another_input_json_file>`: This is useful when user wants to change to another json dataset (e.g., the dev set).
20+
- `-T output_dir=<your_output_dir>`: This changes the default output directory (`./tmp`).
21+
- `-T with_background=True/False`: Whether to include problem background.
22+
- `-T mode=normal/gold/dummy`: This provides two additional modes for sanity checks.
23+
- `normal` mode is the standard mode to evaluate a model
24+
- `gold` mode can only be used on the dev set which loads the gold answer
25+
- `dummy` mode does not call any real LLMs and generates some dummy outputs
26+
27+
For example, user can run five sames on the dev set with background as
28+
29+
```bash
30+
inspect eval scicode.py \
31+
--model openai/gpt-4o \
32+
--temperature 0 \
33+
--limit 5 \
34+
-T input_path=../data/problems_dev.jsonl \
35+
-T output_dir=./tmp/dev \
36+
-T with_background=True \
37+
-T mode=gold
38+
```
39+
40+
For more information regarding `inspect_ai`, we refer users to its [official documentation](https://inspect.ai-safety-institute.org.uk/).

0 commit comments

Comments
 (0)