This repo defines a Pydantic schema and prompt for extracting structured data from free text histopathology reports (macroscopy, microscopy, diagnosis, addenda) into a normalized JSON object suitable for indexing and cohort queries.
- Pydantic schema for histopathology report extraction
- Simple validation helpers
- Default LLM prompt template
- Multi-specimen reports (parts A/B/C)
- Multiple tumours per specimen
- Multiple IHC results per specimen or addendum
- Addenda, amendments, and corrections
- Normalized enums for key fields (site category, procedure, diagnosis category, margins, invasion)
The root model is HistopathologyReportModel in oncollamaschemav3/oncollamaschemav3.py.
Input:
Clinical information
Pigmented lesion involving the oral mucosa.
Macroscopic description
Biopsy from the left lower gingiva comprising a cream-to-tan soft tissue fragment measuring 4 x 3 x 2 mm.
Microscopic description
Oral squamous mucosa is present with coarse pigmented material in the subepithelial connective tissue. No dysplasia or malignancy.
Conclusion
Oral mucosal biopsy: Features consistent with an amalgam tattoo.
Output:
{
"document_is_histopathology_report_flag": true,
"report_contains_pathology_diagnosis_flag": true,
"report_metadata": null,
"sections": {
"clinical_information_desc": "Pigmented lesion involving the oral mucosa.",
"macroscopic_description_desc": "Biopsy from the left lower gingiva comprising a cream-to-tan soft tissue fragment measuring 4 x 3 x 2 mm.",
"microscopic_description_desc": "Oral squamous mucosa is present with coarse pigmented material in the subepithelial connective tissue. No dysplasia or malignancy.",
"formatted_conclusion_desc": "Oral mucosal biopsy: Features consistent with an amalgam tattoo."
},
"specimens": [
{
"part_id": null,
"specimen_label_desc": "left lower gingiva",
"site_category": "oral_cavity",
"site_desc": "left lower gingiva",
"laterality": "left",
"procedure_type": "biopsy",
"procedure_desc": "biopsy",
"specimen_size_desc": "4 x 3 x 2 mm",
"macroscopic_desc": "Biopsy from the left lower gingiva comprising a cream-to-tan soft tissue fragment measuring 4 x 3 x 2 mm.",
"microscopic_desc": "Oral squamous mucosa is present with coarse pigmented material in the subepithelial connective tissue. No dysplasia or malignancy.",
"diagnoses": [
{
"diagnosis_name_desc": "Features consistent with an amalgam tattoo",
"diagnosis_category": "non_neoplastic",
"diagnosis_desc": "Features consistent with an amalgam tattoo.",
"is_primary_diagnosis": true
}
],
"tumours": null,
"ihc_results": null,
"lymph_node_findings": null
}
],
"overall_diagnoses": null,
"addenda": null
}Use oncollamaschemav3/validate.py to validate JSON against the schema.
The default prompt template is in oncollamaschemav3/prompts/infer_prompt.txt. Use create_system_prompt() to inject the schema into the prompt.