Skip to content

Commit

Permalink
Merge branch '157-add_minidocs_from_segments' into 'main'
Browse files Browse the repository at this point in the history
Resolve "Create minidocs from an annotated corpus"

Closes #157

See merge request heka/medkit!174

changelog: Resolve "Create minidocs from an annotated corpus"
  • Loading branch information
ghisvail committed Nov 27, 2023
2 parents daa7eaa + 37f4a45 commit 90bbe43
Show file tree
Hide file tree
Showing 10 changed files with 637 additions and 3 deletions.
1 change: 1 addition & 0 deletions docs/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ parts:
sections:
- file: examples/text_segmentation/section
- file: examples/text_segmentation/syntagma
- file: examples/text_segmentation/document
- file: examples/brat_io
- file: examples/spacy_io
- file: examples/custom_text_operation
Expand Down
8 changes: 8 additions & 0 deletions docs/api/text.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,8 @@ coder normalizer
- Translation operation relying on [HuggingFace transformers](https://huggingface.co/docs/transformers/) models
* - {mod}`AttributeDuplicator<medkit.text.postprocessing.attribute_duplicator>`
- Propagation of attributes based on annotation spans
* - {mod}`DocumentSplitter<medkit.text.postprocessing.document_splitter>`
- A component to divide text documents using its segments as a reference
:::

## Pre-processing modules
Expand Down Expand Up @@ -554,6 +556,12 @@ For the moment, you can use this module to:
- duplicate attributes bewteen segments. For example, you can duplicate an attribute from a sentence to its entities.

- filter overlapping entities: useful when creating named entity reconigtion (NER) datasets
- create mini-documents from a {class}`~.core.text.TextDocument`.


```{admonition} Examples
Creating mini-documents from sections: [document splitter](../examples/text_segmentation/document.md)
```

:::{note}
For more details about public API, refer to {mod}`~.text.postprocessing`.
Expand Down
114 changes: 114 additions & 0 deletions docs/examples/text_segmentation/document.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
---
jupytext:
text_representation:
extension: .md
format_name: myst
format_version: 0.13
jupytext_version: 1.14.5
kernelspec:
display_name: Python 3 (ipykernel)
language: python
name: python3
---


# Document splitter

+++

This tutorial will show an example of how to split a document using its sections as a reference.

```{seealso}
We combine some operations like **section tokenizer**, **regexp matcher** and **custom operation**. Please see the other examples for more information.
```
+++

## Adding annotations in a document

Let's detect the sections and add some annotations using medkit operations.

```{code-cell} ipython3
# You can download the file available in source code
# !wget https://raw.githubusercontent.com/TeamHeka/medkit/main/docs/data/text/1.txt
from pathlib import Path
from medkit.core.text import TextDocument
doc = TextDocument.from_file(Path("../../data/text/1.txt"))
print(doc.text)
```
**Defining the operations**

```{code-cell} ipython3
from medkit.text.ner import RegexpMatcher, RegexpMatcherRule
from medkit.text.segmentation import SectionTokenizer
# Define a section tokenizer
# The section tokenizer uses a dictionary with keywords to identify sections
section_dict = {
"patient": ["SUBJECTIF"],
"traitement": ["MÉDICAMENTS", "PLAN"],
"allergies": ["ALLERGIES"],
"examen clinique": ["EXAMEN PHYSIQUE"],
"diagnostique": ["EVALUATION"],
}
section_tokenizer = SectionTokenizer(section_dict=section_dict)
# Define a NER operation to create 'problem', and 'treatment' entities
regexp_rules = [
RegexpMatcherRule(regexp=r"\ballergies\b", label="problem"),
RegexpMatcherRule(regexp=r"\basthme\b", label="problem"),
RegexpMatcherRule(regexp=r"\ballegra\b", label="treatment", case_sensitive=False),
RegexpMatcherRule(regexp=r"\bvaporisateurs\b", label="treatment"),
RegexpMatcherRule(regexp=r"\bloratadine\b", label="treatment", case_sensitive=False),
RegexpMatcherRule(regexp=r"\bnasonex\b", label="treatment", case_sensitive=False),
]
regexp_matcher = RegexpMatcher(rules=regexp_rules)
```

We can now annotate the document

```{code-cell} ipython3
# Detect annotations
sections = section_tokenizer.run([doc.raw_segment])
entities = regexp_matcher.run([doc.raw_segment])
# Annotate
for ann in sections + entities:
doc.anns.add(ann)
print(f"The document contains {len(sections)} sections and {len(entities)} entities\n")
```

## Split the document by sections

Once annotated, we can use the medkit operation {class}`~medkit.text.postprocessing.DocumentSplitter` to create smaller versions of the document using the sections.

By default, since its `entity_labels`, `attr_labels`, and `relation_labels` are set to `None`, all annotations will be in the resulting documents. You can select the annotations using their labels.

```{code-cell} ipython3
from medkit.text.postprocessing import DocumentSplitter
doc_splitter = DocumentSplitter(segment_label="section", # segments of reference
entity_labels=["treatment","problem"],# entities to include
attr_labels=[], # without attrs
relation_labels=[], #without relations
)
new_docs = doc_splitter.run([doc])
print(f"The document was divided into {len(new_docs)} documents\n")
```

Each document contains entities and attributes from the source segment; below, we visualize the new documents via displacy utils.

```{code-cell} ipython3
from spacy import displacy
from medkit.text.spacy.displacy_utils import medkit_doc_to_displacy
options_displacy = dict(colors={'treatment': "#85C1E9", "problem": "#ff6961"})
for new_doc in new_docs:
print(f"New document from the section called '{new_doc.metadata['name']}'")
# convert new document to displacy
displacy_data = medkit_doc_to_displacy(new_doc)
displacy.render(displacy_data, manual=True, style="ent", options=options_displacy)
```

2 changes: 2 additions & 0 deletions medkit/text/postprocessing/__init__.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
__all__ = [
"AttributeDuplicator",
"compute_nested_segments",
"DocumentSplitter",
"filter_overlapping_entities",
]

from .alignment_utils import compute_nested_segments
from .attribute_duplicator import AttributeDuplicator
from .document_splitter import DocumentSplitter
from .overlapping import filter_overlapping_entities
5 changes: 3 additions & 2 deletions medkit/text/postprocessing/alignment_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ def compute_nested_segments(
source_segments: List[Segment], target_segments: List[Segment]
) -> List[Tuple[Segment, List[Segment]]]:
"""Return source segments aligned with its nested segments.
Only nested segments fully contained in the `source_segments` are returned.
Parameters
----------
Expand All @@ -58,8 +59,8 @@ def compute_nested_segments(

if not normalized_spans:
continue

start, end = normalized_spans[0].start, normalized_spans[-1].end
children = [child.data for child in tree.overlap(start, end)]
# use 'tree.envelop' to get only fully contained children
children = [child.data for child in tree.envelop(start, end)]
nested.append((parent, children))
return nested
Loading

0 comments on commit 90bbe43

Please sign in to comment.