Skip to content

Integrate inspect_ai for evaluation #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jan 25, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ jobs:
uses: actions/checkout@v2
- uses: actions/setup-python@v5
with:
python-version: '3.9'
python-version: '3.10'
- name: Create dummy keys.cfg
run: touch keys.cfg
- name: Install uv
Expand Down
7 changes: 5 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,13 @@
keys.cfg
**/test_result/**
**/output/**
**/eval_results/**
**/eval_results*/**
eval/logs/**
*.h5

logs/**
**/logs/**
**/tmp/**
integration/**

# -------

Expand Down
15 changes: 15 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ This repo contains the evaluation code for the paper "[SciCode: A Research Codin

## 🔔News

**[2025-01-24]: SciCode has been integrated with [`inspect_ai`](https://inspect.ai-safety-institute.org.uk/) for easier and faster model evaluations.**

**[2024-11-04]: Leaderboard is on! Check [here](https://scicode-bench.github.io/leaderboard/). We have also added Claude Sonnet 3.5 (new) results.**

**[2024-10-01]: We have added OpenAI o1-mini and o1-preview results.**
Expand Down Expand Up @@ -54,6 +56,19 @@ SciCode sources challenging and realistic research-level coding problems across
4. Run `eval/scripts/gencode_json.py` to generate new model outputs (see the [`eval/scripts` readme](eval/scripts/)) for more information
5. Run `eval/scripts/test_generated_code.py` to evaluate the unittests


## Instructions to evaluate a new model using `inspect_ai` (recommended)

Scicode has been integrated with `inspect_ai` for easier and faster model evaluation, compared with the methods above. You need to run the first three steps in the [above section](#instructions-to-evaluate-a-new-model), and then go to the `eval/inspect_ai` directory, setup correspoinding API key, and run the following command:

```bash
cd eval/inspect_ai
export OPENAI_API_KEY=your-openai-api-key
inspect eval scicode.py --model openai/gpt-4o --temperature 0
```

For more detailed information of using `inspect_ai`, see [`eval/inspect_ai` readme](eval/inspect_ai/)

## More information and FAQ

More information, including a [FAQ section](https://scicode-bench.github.io/faq/), is provided on our [website](https://scicode-bench.github.io/).
Expand Down
40 changes: 40 additions & 0 deletions eval/inspect_ai/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
## **SciCode Evaluation using `inspect_ai`**

### 1. Set Up Your API Keys

Users can follow [`inspect_ai`'s official documentation](https://inspect.ai-safety-institute.org.uk/#getting-started) to setup correpsonding API keys depending on the types of models they would like to evaluate.

### 2. Setup Command Line Arguments if Needed

In most cases, after users setting up the key, they can directly start the SciCode evaluation via the following command.

```bash
inspect eval scicode.py --model <your_model> --temperature 0
```

However, there are some additional command line arguments that could be useful as well.

- `--max_connections`: Maximum amount of API connections to the evaluated model.
- `--limit`: Limit of the number of samples to evaluate in the SciCode dataset.
- `-T input_path=<another_input_json_file>`: This is useful when user wants to change to another json dataset (e.g., the dev set).
- `-T output_dir=<your_output_dir>`: This changes the default output directory (`./tmp`).
- `-T with_background=True/False`: Whether to include problem background.
- `-T mode=normal/gold/dummy`: This provides two additional modes for sanity checks.
- `normal` mode is the standard mode to evaluate a model
- `gold` mode can only be used on the dev set which loads the gold answer
- `dummy` mode does not call any real LLMs and generates some dummy outputs

For example, user can run five sames on the dev set with background as

```bash
inspect eval scicode.py \
--model openai/gpt-4o \
--temperature 0 \
--limit 5 \
-T input_path=../data/problems_dev.jsonl \
-T output_dir=./tmp/dev \
-T with_background=True \
-T mode=gold
```

For more information regarding `inspect_ai`, we refer users to its [official documentation](https://inspect.ai-safety-institute.org.uk/).
Loading
Loading