Skip to content

More dataset loaders for the rule report / benchmarks #5

Description

@adaamko

Goal

Let the rule report / benchmarks run on more datasets out of the box. The generic path already exists — this issue is only about adding built-in loaders.

Current state

benchmarks/rule_report.py::load_gold (L37) already accepts any data via --data <file.jsonl>:

if args.data:
    rows = [json.loads(l) for l in Path(args.data).read_text().splitlines() if l.strip()]
    types = sorted({e["type"] for r in rows for e in r.get("entities", [])})
    return rows, types

Each line is one doc: {"text": "...", "entities": [{"text","start","end","type"}, ...]}.
The only built-in --dataset option is tab.

What to do

  1. Add 1–2 loaders under --dataset (suggested: conll2003, ontonotes, or ai4privacy). Each returns (rows, types) in the exact shape above — convert the source dataset's token/IOB or char-span annotations into {text, entities:[{text,start,end,type}]} with correct character offsets.
  2. Add the choice to the --dataset argparse choices.
  3. Document the --data JSONL schema in benchmarks/INSPECTING_RULES.md.

Gotchas

  • Get character offsets right — many NER datasets are token/IOB; you must reconstruct start/end into the joined text, or the report's overlap matching will be wrong.
  • Keep it dependency-light: load via datasets (already used by the TAB loader).

Acceptance

python benchmarks/rule_report.py --rules benchmarks/results/results_extract_tab.ckpt_rulechef.json --dataset conll2003 --out r.html produces a report (rules won't match CoNLL types — that's fine; the loader is what's being tested). A tiny offset sanity check (text[e["start"]:e["end"]] == e["text"]) for the new loader.

Pointers

benchmarks/rule_report.py:37 (load_gold), benchmarks/benchmark_extract.py (load_tab_ds as the reference loader). Good first issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions