Goal
Let the rule report / benchmarks run on more datasets out of the box. The generic path already exists — this issue is only about adding built-in loaders.
Current state
benchmarks/rule_report.py::load_gold (L37) already accepts any data via --data <file.jsonl>:
if args.data:
rows = [json.loads(l) for l in Path(args.data).read_text().splitlines() if l.strip()]
types = sorted({e["type"] for r in rows for e in r.get("entities", [])})
return rows, types
Each line is one doc: {"text": "...", "entities": [{"text","start","end","type"}, ...]}.
The only built-in --dataset option is tab.
What to do
- Add 1–2 loaders under
--dataset (suggested: conll2003, ontonotes, or ai4privacy). Each returns (rows, types) in the exact shape above — convert the source dataset's token/IOB or char-span annotations into {text, entities:[{text,start,end,type}]} with correct character offsets.
- Add the choice to the
--dataset argparse choices.
- Document the
--data JSONL schema in benchmarks/INSPECTING_RULES.md.
Gotchas
- Get character offsets right — many NER datasets are token/IOB; you must reconstruct
start/end into the joined text, or the report's overlap matching will be wrong.
- Keep it dependency-light: load via
datasets (already used by the TAB loader).
Acceptance
python benchmarks/rule_report.py --rules benchmarks/results/results_extract_tab.ckpt_rulechef.json --dataset conll2003 --out r.html produces a report (rules won't match CoNLL types — that's fine; the loader is what's being tested). A tiny offset sanity check (text[e["start"]:e["end"]] == e["text"]) for the new loader.
Pointers
benchmarks/rule_report.py:37 (load_gold), benchmarks/benchmark_extract.py (load_tab_ds as the reference loader). Good first issue.
Goal
Let the rule report / benchmarks run on more datasets out of the box. The generic path already exists — this issue is only about adding built-in loaders.
Current state
benchmarks/rule_report.py::load_gold(L37) already accepts any data via--data <file.jsonl>:Each line is one doc:
{"text": "...", "entities": [{"text","start","end","type"}, ...]}.The only built-in
--datasetoption istab.What to do
--dataset(suggested:conll2003,ontonotes, orai4privacy). Each returns(rows, types)in the exact shape above — convert the source dataset's token/IOB or char-span annotations into{text, entities:[{text,start,end,type}]}with correct character offsets.--datasetargparsechoices.--dataJSONL schema inbenchmarks/INSPECTING_RULES.md.Gotchas
start/endinto the joined text, or the report's overlap matching will be wrong.datasets(already used by the TAB loader).Acceptance
python benchmarks/rule_report.py --rules benchmarks/results/results_extract_tab.ckpt_rulechef.json --dataset conll2003 --out r.htmlproduces a report (rules won't match CoNLL types — that's fine; the loader is what's being tested). A tiny offset sanity check (text[e["start"]:e["end"]] == e["text"]) for the new loader.Pointers
benchmarks/rule_report.py:37(load_gold),benchmarks/benchmark_extract.py(load_tab_dsas the reference loader). Good first issue.