Merge branch '157-add_minidocs_from_segments' into 'main'

Resolve "Create minidocs from an annotated corpus" Closes #157 See merge request heka/medkit!174 changelog: Resolve "Create minidocs from an annotated corpus"
medkit-lib · Nov 27, 2023 · 90bbe43 · 90bbe43
2 parents daa7eaa + 37f4a45
commit 90bbe43
Show file tree

Hide file tree

Showing 10 changed files with 637 additions and 3 deletions.
diff --git a/docs/_toc.yml b/docs/_toc.yml
@@ -27,6 +27,7 @@ parts:
     sections:
       - file: examples/text_segmentation/section
       - file: examples/text_segmentation/syntagma
+      - file: examples/text_segmentation/document
   - file: examples/brat_io
   - file: examples/spacy_io
   - file: examples/custom_text_operation

diff --git a/docs/api/text.md b/docs/api/text.md
@@ -90,6 +90,8 @@ coder normalizer
     - Translation operation relying on [HuggingFace transformers](https://huggingface.co/docs/transformers/) models
 *   - {mod}`AttributeDuplicator<medkit.text.postprocessing.attribute_duplicator>`
     - Propagation of attributes based on annotation spans
+*   - {mod}`DocumentSplitter<medkit.text.postprocessing.document_splitter>`
+    - A component to divide text documents using its segments as a reference
 :::
 
 ## Pre-processing modules
@@ -554,6 +556,12 @@ For the moment, you can use this module to:
 - duplicate attributes bewteen segments. For example, you can duplicate an attribute from a sentence to its entities.
 
 - filter overlapping entities: useful when creating named entity reconigtion (NER) datasets
+- create mini-documents from a {class}`~.core.text.TextDocument`. 
+
+
+```{admonition} Examples
+Creating mini-documents from sections: [document splitter](../examples/text_segmentation/document.md)
+```
 
 :::{note}
 For more details about public API, refer to {mod}`~.text.postprocessing`.

diff --git a/docs/examples/text_segmentation/document.md b/docs/examples/text_segmentation/document.md
@@ -0,0 +1,114 @@
+---
+jupytext:
+  text_representation:
+    extension: .md
+    format_name: myst
+    format_version: 0.13
+    jupytext_version: 1.14.5
+kernelspec:
+  display_name: Python 3 (ipykernel)
+  language: python
+  name: python3
+---
+
+
+# Document splitter
+
++++
+
+This tutorial will show an example of how to split a document using its sections as a reference. 
+
+```{seealso}
+We combine some operations like **section tokenizer**, **regexp matcher** and **custom operation**. Please see the other examples for more information.
+```
++++
+
+## Adding annotations in a document
+
+Let's detect the sections and add some annotations using medkit operations.
+
+```{code-cell} ipython3
+# You can download the file available in source code
+# !wget https://raw.githubusercontent.com/TeamHeka/medkit/main/docs/data/text/1.txt
+
+from pathlib import Path
+from medkit.core.text import TextDocument
+
+doc = TextDocument.from_file(Path("../../data/text/1.txt"))
+print(doc.text)
+```
+**Defining the operations**
+
+```{code-cell} ipython3
+from medkit.text.ner import RegexpMatcher, RegexpMatcherRule
+from medkit.text.segmentation import SectionTokenizer
+
+# Define a section tokenizer
+# The section tokenizer uses a dictionary with keywords to identify sections
+section_dict = {
+    "patient": ["SUBJECTIF"],
+    "traitement": ["MÉDICAMENTS", "PLAN"],
+    "allergies": ["ALLERGIES"],
+    "examen clinique": ["EXAMEN PHYSIQUE"],
+    "diagnostique": ["EVALUATION"],
+}
+section_tokenizer = SectionTokenizer(section_dict=section_dict)
+
+# Define a NER operation to create 'problem', and 'treatment' entities
+regexp_rules = [
+    RegexpMatcherRule(regexp=r"\ballergies\b", label="problem"),
+    RegexpMatcherRule(regexp=r"\basthme\b", label="problem"),
+    RegexpMatcherRule(regexp=r"\ballegra\b", label="treatment", case_sensitive=False),
+    RegexpMatcherRule(regexp=r"\bvaporisateurs\b", label="treatment"),
+    RegexpMatcherRule(regexp=r"\bloratadine\b", label="treatment", case_sensitive=False),
+    RegexpMatcherRule(regexp=r"\bnasonex\b", label="treatment", case_sensitive=False),
+]
+regexp_matcher = RegexpMatcher(rules=regexp_rules)
+```
+
+We can now annotate the document
+
+```{code-cell} ipython3
+# Detect annotations
+sections = section_tokenizer.run([doc.raw_segment])
+entities = regexp_matcher.run([doc.raw_segment])
+# Annotate
+for ann in sections + entities:
+    doc.anns.add(ann)
+
+print(f"The document contains {len(sections)} sections and {len(entities)} entities\n")
+```
+
+## Split the document by sections 
+
+Once annotated, we can use the medkit operation {class}`~medkit.text.postprocessing.DocumentSplitter` to create smaller versions of the document using the sections. 
+
+By default, since its `entity_labels`, `attr_labels`, and `relation_labels` are set to `None`, all annotations will be in the resulting documents. You can select the annotations using their labels.
+
+```{code-cell} ipython3
+from medkit.text.postprocessing import DocumentSplitter
+
+doc_splitter = DocumentSplitter(segment_label="section", # segments of reference
+                                entity_labels=["treatment","problem"],# entities to include 
+                                attr_labels=[], # without attrs
+                                relation_labels=[], #without relations
+)
+new_docs = doc_splitter.run([doc])
+print(f"The document was divided into {len(new_docs)} documents\n")
+```
+
+Each document contains entities and attributes from the source segment; below, we visualize the new documents via displacy utils.
+
+```{code-cell} ipython3
+from spacy import displacy
+from medkit.text.spacy.displacy_utils import medkit_doc_to_displacy
+
+options_displacy = dict(colors={'treatment': "#85C1E9", "problem": "#ff6961"})
+
+for new_doc in new_docs:
+    print(f"New document from the section called '{new_doc.metadata['name']}'")
+    # convert new document to displacy 
+    displacy_data = medkit_doc_to_displacy(new_doc)
+    displacy.render(displacy_data, manual=True, style="ent", options=options_displacy)
+```
+
diff --git a/medkit/text/postprocessing/__init__.py b/medkit/text/postprocessing/__init__.py
@@ -1,9 +1,11 @@
 __all__ = [
     "AttributeDuplicator",
     "compute_nested_segments",
+    "DocumentSplitter",
     "filter_overlapping_entities",
 ]
 
 from .alignment_utils import compute_nested_segments
 from .attribute_duplicator import AttributeDuplicator
+from .document_splitter import DocumentSplitter
 from .overlapping import filter_overlapping_entities
diff --git a/medkit/text/postprocessing/alignment_utils.py b/medkit/text/postprocessing/alignment_utils.py
@@ -38,6 +38,7 @@ def compute_nested_segments(
     source_segments: List[Segment], target_segments: List[Segment]
 ) -> List[Tuple[Segment, List[Segment]]]:
     """Return source segments aligned with its nested segments.
+    Only nested segments fully contained in the `source_segments` are returned.
 
     Parameters
     ----------
@@ -58,8 +59,8 @@ def compute_nested_segments(
 
         if not normalized_spans:
             continue
-
         start, end = normalized_spans[0].start, normalized_spans[-1].end
-        children = [child.data for child in tree.overlap(start, end)]
+        # use 'tree.envelop' to get only fully contained children
+        children = [child.data for child in tree.envelop(start, end)]
         nested.append((parent, children))
     return nested