Skip to content

Commit c67df43

Browse files
committed
migirate inspect_ai to the HF dataset
1 parent 019de40 commit c67df43

File tree

2 files changed

+13
-13
lines changed

2 files changed

+13
-13
lines changed

eval/inspect_ai/README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,26 +16,26 @@ However, there are some additional command line arguments that could be useful a
1616

1717
- `--max-connections`: Maximum amount of API connections to the evaluated model.
1818
- `--limit`: Limit of the number of samples to evaluate in the SciCode dataset.
19-
- `-T input_path=<another_input_json_file>`: This is useful when user wants to change to another json dataset (e.g., the dev set).
19+
- `-T split=validation/test`: Whether the user wants to run on the small `validation` set (15 samples) or the large `test` set (65 samples).
2020
- `-T output_dir=<your_output_dir>`: This changes the default output directory (`./tmp`).
2121
- `-T h5py_file=<your_h5py_file>`: This is used if your h5py file is not downloaded in the recommended directory.
2222
- `-T with_background=True/False`: Whether to include problem background.
2323
- `-T mode=normal/gold/dummy`: This provides two additional modes for sanity checks.
2424
- `normal` mode is the standard mode to evaluate a model
25-
- `gold` mode can only be used on the dev set which loads the gold answer
25+
- `gold` mode can only be used on the validation set which loads the gold answer
2626
- `dummy` mode does not call any real LLMs and generates some dummy outputs
2727

28-
For example, user can run five sames on the dev set with background as
28+
For example, user can run five samples on the validation set with background as
2929

3030
```bash
3131
inspect eval scicode.py \
3232
--model openai/gpt-4o \
3333
--temperature 0 \
3434
--limit 5 \
35-
-T input_path=../data/problems_dev.jsonl \
36-
-T output_dir=./tmp/dev \
35+
-T split=validation \
36+
-T output_dir=./tmp/val \
3737
-T with_background=True \
38-
-T mode=gold
38+
-T mode=normal
3939
```
4040

4141
User can run the evaluation on `Deepseek-v3` using together ai via the following command:

eval/inspect_ai/scicode.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
from typing import Any
66
from pathlib import Path
77
from inspect_ai import Task, task
8-
from inspect_ai.dataset import json_dataset, Sample
8+
from inspect_ai.dataset import Sample, hf_dataset
99
from inspect_ai.solver import solver, TaskState, Generate
1010
from inspect_ai.scorer import scorer, mean, metric, Metric, Score, Target
1111
from scicode.parse.parse import extract_function_name, get_function_from_code
@@ -392,26 +392,26 @@ async def score(state: TaskState, target: Target):
392392

393393
@task
394394
def scicode(
395-
input_path: str = '../data/problems_all.jsonl',
395+
split: str = 'test',
396396
output_dir: str = './tmp',
397397
with_background: bool = False,
398398
h5py_file: str = '../data/test_data.h5',
399399
mode: str = 'normal',
400400
):
401-
dataset = json_dataset(
402-
input_path,
403-
record_to_sample
401+
402+
dataset = hf_dataset(
403+
'Zilinghan/scicode',
404+
split=split,
405+
sample_fields=record_to_sample,
404406
)
405407
return Task(
406408
dataset=dataset,
407409
solver=scicode_solver(
408-
input_path=input_path,
409410
output_dir=output_dir,
410411
with_background=with_background,
411412
mode=mode,
412413
),
413414
scorer=scicode_scorer(
414-
input_path=input_path,
415415
output_dir=output_dir,
416416
with_background=with_background,
417417
h5py_file=h5py_file,

0 commit comments

Comments
 (0)