Goal: Run the full ICD-11 E2E evaluation workflow on a prepared validation/sample dataset and produce a repeatable evaluation of re-ranked scores and AI Assistant outputs. This is a milestone toward the May WHO/HL7 ICD-11 connectathon preparation. This showcase will be done before Sunny's post-vacation changes can be made in Mapper.
Demo Driver: Filipe / Joe
Steps (proposed):
-
Load the prepared ICD-11 validation dataset (Tracker 55) — a curated set of CIEL concepts with known ICD-11 ground-truth mappings
-
Run the full multi-algorithm matching pipeline:
- OCL CIEL Bridge
- WHO ICD-11 automatch (Tracker 51)
- Filipe's LLM-as-terminologist (Tracker 52)
- Re-ranker: BAAI/bge-reranker-v2-m3
- OCL AI Assistant (Claude, with ICD-11 tailored prompts from Tracker 4)
-
Export the candidate results and re-ranked scores
-
Compare algorithm outputs against the ground-truth mappings:
- Which algorithm produced the correct top-1 result?
- Where do algorithms agree vs. disagree?
- How does the AI Assistant recommendation compare to automatch?
-
Document evaluation metrics (e.g. top-1 accuracy, top-3 accuracy per algorithm)
-
Identify failure cases and document for iterative improvement
-
Share evaluation summary with team
List view
0 issues of 3 selected
- Status: Open.#2385 In OpenConceptLab/ocl_issues;
- Status: Open.#2396 In OpenConceptLab/ocl_issues;
- Status: Open.#2395 In OpenConceptLab/ocl_issues;