Skip to content

Commit 0dda67b

Browse files
committed
Fixing issue #60 to ensure that SpacyQuickUMLS cannot add entity spans which overlap on a token. Also added some documentation to the class and README.
1 parent 96fcf59 commit 0dda67b

File tree

2 files changed

+23
-2
lines changed

2 files changed

+23
-2
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ If the matcher throws a warning during initialization, read [this page](https://
5454

5555
## spaCy pipeline component
5656

57-
QuickUMLS can be used for standalone processing but it can also be use as a component in a modular spaCy pipeline. This follows traditional spaCy handling of concepts to be entity objects added to the Document object. These entity objects contain the CUI, similarity score and Semantic Types in the spacy "underscore" object.
57+
QuickUMLS can be used for standalone processing but it can also be use as a component in a modular spaCy pipeline. This follows traditional spaCy handling of concepts to be entity objects added to the Document object. These entity objects contain the CUI, similarity score and Semantic Types in the spacy "underscore" object. Note that this implementation follows a [known spacy convention](https://github.com/explosion/spaCy/issues/3608) that entity Spans cannot overlap on a single token. To prevent token overlap, matches are ranked according to the `overlapping_criteria` supplied so that overlap of any tokens will be prioritized by this order.
5858

5959
Adding QuickUMLS as a component in a pipeline can be done as follows:
6060

quickumls/spacy_component.py

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,13 @@ def __init__(self, nlp, quickumls_fp, best_match=True, ignore_syntax=False, **kw
1313
1414
This creates a QuickUMLS spaCy component which can be used in modular pipelines.
1515
This module adds entity Spans to the document where the entity label is the UMLS CUI and the Span's "underscore" object is extended to contains "similarity" and "semtypes" for matched concepts.
16+
Note that this implementation follows and enforces a known spacy convention that entity Spans cannot overlap on a single token.
1617
1718
Args:
1819
nlp: Existing spaCy pipeline. This is needed to update the vocabulary with UMLS CUI values
1920
quickumls_fp (str): Path to QuickUMLS data
2021
best_match (bool, optional): Whether to return only the top match or all overlapping candidates. Defaults to True.
21-
ignore_syntax (bool, optional): Wether to use the heuristcs introduced in the paper (Soldaini and Goharian, 2016). TODO: clarify,. Defaults to False
22+
ignore_syntax (bool, optional): Whether to use the heuristcs introduced in the paper (Soldaini and Goharian, 2016). TODO: clarify,. Defaults to False
2223
**kwargs: QuickUMLS keyword arguments (see QuickUMLS in core.py)
2324
"""
2425

@@ -43,6 +44,15 @@ def __call__(self, doc):
4344
# pass in the document which has been parsed to this point in the pipeline for ngrams and matches
4445
matches = self.quickumls._match(doc, best_match=self.best_match, ignore_syntax=self.ignore_syntax)
4546

47+
# NOTE: Spacy spans do not allow overlapping tokens, so we prevent the overlap here
48+
# For more information, see: https://github.com/explosion/spaCy/issues/3608
49+
tokens_in_ents_set = set()
50+
51+
# let's track any other entities which may have been attached via upstream components
52+
for ent in doc.ents:
53+
for token_index in range(ent.start, ent.end):
54+
tokens_in_ents_set.add(token_index)
55+
4656
# Convert QuickUMLS match objects into Spans
4757
for match in matches:
4858
# each match may match multiple ngrams
@@ -59,6 +69,17 @@ def __call__(self, doc):
5969
# char_span() creates a Span from these character indices
6070
# UMLS CUI should work well as the label here
6171
span = doc.char_span(start_char_idx, end_char_idx, label = cui_label_value)
72+
73+
# before we add this, let's make sure that this entity does not overlap any tokens added thus far
74+
candidate_token_indexes = set(range(span.start, span.end))
75+
76+
# check the intersection and skip this if there is any overlap
77+
if len(tokens_in_ents_set.intersection(candidate_token_indexes)) > 0:
78+
continue
79+
80+
# track this to make sure we do not introduce overlap later
81+
tokens_in_ents_set.update(candidate_token_indexes)
82+
6283
# add some custom metadata to the spans
6384
span._.similarity = ngram_match_dict['similarity']
6485
span._.semtypes = ngram_match_dict['semtypes']

0 commit comments

Comments
 (0)