From 5afd01e77e34433697fa26bdf964a9944df2b914 Mon Sep 17 00:00:00 2001 From: Yaru Wang Date: Thu, 11 Jun 2026 09:49:00 +0200 Subject: [PATCH] Add example: polymorphic (multi-type) fields A teaching notebook for fields whose value type varies across records. Shows declaring the field with a list-valued `type` (e.g. ["string","number","object"]), the coarse default (exact) behavior, and a custom comparator that understands every shape, plus a strict_types validation note and the limits (#83/#93). --- examples/08_example_polymorphic.ipynb | 193 ++++++++++++++++++++++++++ 1 file changed, 193 insertions(+) create mode 100644 examples/08_example_polymorphic.ipynb diff --git a/examples/08_example_polymorphic.ipynb b/examples/08_example_polymorphic.ipynb new file mode 100644 index 0000000..a34c6a3 --- /dev/null +++ b/examples/08_example_polymorphic.ipynb @@ -0,0 +1,193 @@ +{ + "cells": [ + { + "metadata": {}, + "cell_type": "markdown", + "source": [ + "# Polymorphic fields\n", + "\n", + "A field is **polymorphic** (\"multiform\") when its value is not always the same\n", + "JSON type across records -- a `quantity` that is `\"35 nm\"` (string) in one\n", + "record, `35` (number) in another, `{\"value\": 35, \"unit\": \"nm\"}` (object) in a\n", + "third.\n", + "\n", + "JSON Schema lets you declare this with a list-valued `type`, and this evaluator\n", + "supports it directly:\n", + "\n", + "- `{\"type\": [\"string\", \"null\"]}` -- nullable; the `null` is dropped, so it is\n", + " just a `string` field.\n", + "- `{\"type\": [\"string\", \"number\", \"object\"]}` -- a genuine multi-shape field.\n", + "\n", + "`anyOf` / `oneOf` / `if`-`then`-`else` in json schema might also generate multiple types." + ], + "id": "8ce56dfc96862d04" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The problem: the default comparator is coarse\n", + "\n", + "Declare `quantity` honestly as `[\"string\", \"number\", \"object\"]`, you need a custom comparator to deal with the different types." + ], + "id": "6d0263cde7e2cbf3" + }, + { + "cell_type": "code", + "metadata": { + "ExecuteTime": { + "end_time": "2026-06-11T07:46:09.416323Z", + "start_time": "2026-06-11T07:46:09.411293Z" + } + }, + "source": [ + "from struct_extract_eval import evaluate\n", + "from struct_extract_eval.core.comparators.comparator import ComparatorResult\n", + "from struct_extract_eval.core.comparators.registry import register as register_comparator\n", + "from struct_extract_eval.core.xeval import annotate_xeval\n", + "from typing_extensions import override\n", + "\n", + "\n", + "def show(title, result):\n", + " print(title)\n", + " for rec in result.records:\n", + " for fr in rec.field_results:\n", + " print(f\" [{rec.record_id}] {fr.path:<8} gold={fr.gold_value!r:<24} extracted={fr.extracted_value!r:<20} -> {fr.status}\")\n", + " print(f\" mean F1={result.mean_f1:.2f}\")\n", + "\n", + "\n", + "GOLD = [\n", + " {\"quantity\": {\"value\": 35, \"unit\": \"nm\"}}, # object form\n", + " {\"quantity\": 50}, # number form\n", + " {\"quantity\": \"60 nm\"}, # string form\n", + "]\n", + "EXTRACTED = [\n", + " {\"quantity\": \"35 nm\"}, # same meaning as gold, different shape\n", + " {\"quantity\": \"50\"}, # same meaning as gold, different shape\n", + " {\"quantity\": \"65 nm\"}, # genuinely wrong (60 != 65)\n", + "]\n", + "\n", + "# Honest multi-type declaration. No comparator -> default `exact`.\n", + "# annotate_xeval assigns that default (evaluate does not annotate for you).\n", + "schema_default = {\n", + " \"type\": \"object\",\n", + " \"properties\": {\"quantity\": {\"type\": [\"string\", \"number\", \"object\"]}},\n", + "}\n", + "annotate_xeval(schema_default)" + ], + "id": "ee75499317734249", + "outputs": [ + { + "data": { + "text/plain": [ + "{'type': 'object',\n", + " 'properties': {'quantity': {'type': ['string', 'number', 'object'],\n", + " 'x-eval-compare': 'exact'}}}" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "execution_count": 6 + }, + { + "cell_type": "code", + "metadata": { + "ExecuteTime": { + "end_time": "2026-06-11T07:46:42.710197Z", + "start_time": "2026-06-11T07:46:42.706882Z" + } + }, + "source": [ + "def _to_value_unit(v):\n", + " if isinstance(v, dict):\n", + " return v.get(\"value\"), v.get(\"unit\")\n", + " if isinstance(v, (int, float)):\n", + " return float(v), None\n", + " if isinstance(v, str):\n", + " parts = v.split()\n", + " value = float(parts[0]) if parts else None\n", + " unit = parts[1] if len(parts) > 1 else None\n", + " return value, unit\n", + " return None, None\n", + "\n", + "\n", + "def compare_quantity(gold, extracted, params):\n", + " if _to_value_unit(gold) == _to_value_unit(extracted):\n", + " return ComparatorResult(score=1.0, comparator=\"quantity\")\n", + " return ComparatorResult(score=0.0, comparator=\"quantity\", reason=f\"{gold!r} != {extracted!r}\")\n", + "\n", + "\n", + "register_comparator(\"quantity\", compare_quantity, overwrite=True)\n", + "\n", + "schema_poly = {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"quantity\": {\"type\": [\"string\", \"number\", \"object\"], \"x-eval-compare\": \"quantity\"},\n", + " },\n", + "}\n", + "annotate_xeval(schema_poly) # no-op for `quantity` (it already has a comparator)\n", + "\n", + "show(\"with quantity comparator\", evaluate(GOLD, EXTRACTED, schema_poly))" + ], + "id": "67606c32592e2451", + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "with quantity comparator\n", + " [0] quantity gold={'value': 35, 'unit': 'nm'} extracted='35 nm' -> match\n", + " [1] quantity gold=50 extracted='50' -> match\n", + " [2] quantity gold='60 nm' extracted='65 nm' -> mismatch\n", + " mean F1=0.67\n" + ] + } + ], + "execution_count": 9 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now the two cross-shape-but-equal records match, and the genuinely wrong one\n", + "(`60 nm` vs `65 nm`) is still a mismatch. One comparator, every shape covered." + ], + "id": "93af4a6e90471675" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Validating polymorphic gold\n", + "\n", + "By default `validate_gold` treats `type` as a hint (mismatches only warn). For a\n", + "multi-type field it accepts gold of any declared shape. If you want to *enforce*\n", + "that gold is one of the declared types, pass `strict_types=True`:\n", + "\n", + "```python\n", + "from struct_extract_eval.core.validation import validate_gold\n", + "\n", + "validate_gold(GOLD, schema_poly)\n", + "validate_gold(GOLD, schema_poly, strict_types=True) # raise err if a gold value is none of string/number/object (null is exempt)\n", + "```" + ], + "id": "6c89913d5cc35ee6" + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}