Fixing issue #60 to ensure that SpacyQuickUMLS cannot add entity spans which overlap on a token. Also added some documentation to the class and README.

burgersmoke · burgersmoke · commit 0dda67b4cef8 · 2020-09-03T16:13:32.000-06:00
diff --git a/README.md b/README.md
@@ -54,7 +54,7 @@ If the matcher throws a warning during initialization, read [this page](https://
 
 ## spaCy pipeline component
 
-QuickUMLS can be used for standalone processing but it can also be use as a component in a modular spaCy pipeline.  This follows traditional spaCy handling of concepts to be entity objects added to the Document object.  These entity objects contain the CUI, similarity score and Semantic Types in the spacy "underscore" object.
+QuickUMLS can be used for standalone processing but it can also be use as a component in a modular spaCy pipeline.  This follows traditional spaCy handling of concepts to be entity objects added to the Document object.  These entity objects contain the CUI, similarity score and Semantic Types in the spacy "underscore" object.  Note that this implementation follows a [known spacy convention](https://github.com/explosion/spaCy/issues/3608) that entity Spans cannot overlap on a single token. To prevent token overlap, matches are ranked according to the `overlapping_criteria` supplied so that overlap of any tokens will be prioritized by this order.
 
 Adding QuickUMLS as a component in a pipeline can be done as follows:
 
diff --git a/quickumls/spacy_component.py b/quickumls/spacy_component.py
@@ -13,12 +13,13 @@ def __init__(self, nlp, quickumls_fp, best_match=True, ignore_syntax=False, **kw
 
             This creates a QuickUMLS spaCy component which can be used in modular pipelines.  
             This module adds entity Spans to the document where the entity label is the UMLS CUI and the Span's "underscore" object is extended to contains "similarity" and "semtypes" for matched concepts.
+            Note that this implementation follows and enforces a known spacy convention that entity Spans cannot overlap on a single token.
 
         Args:
             nlp: Existing spaCy pipeline.  This is needed to update the vocabulary with UMLS CUI values
             quickumls_fp (str): Path to QuickUMLS data
             best_match (bool, optional): Whether to return only the top match or all overlapping candidates. Defaults to True.
-            ignore_syntax (bool, optional): Wether to use the heuristcs introduced in the paper (Soldaini and Goharian, 2016). TODO: clarify,. Defaults to False
+            ignore_syntax (bool, optional): Whether to use the heuristcs introduced in the paper (Soldaini and Goharian, 2016). TODO: clarify,. Defaults to False
             **kwargs: QuickUMLS keyword arguments (see QuickUMLS in core.py)
         """
         
@@ -43,6 +44,15 @@ def __call__(self, doc):
         # pass in the document which has been parsed to this point in the pipeline for ngrams and matches
         matches = self.quickumls._match(doc, best_match=self.best_match, ignore_syntax=self.ignore_syntax)
         
+        # NOTE: Spacy spans do not allow overlapping tokens, so we prevent the overlap here
+        # For more information, see: https://github.com/explosion/spaCy/issues/3608
+        tokens_in_ents_set = set()
+        
+        # let's track any other entities which may have been attached via upstream components
+        for ent in doc.ents:
+            for token_index in range(ent.start, ent.end):
+                tokens_in_ents_set.add(token_index)
+        
         # Convert QuickUMLS match objects into Spans
         for match in matches:
             # each match may match multiple ngrams
@@ -59,6 +69,17 @@ def __call__(self, doc):
                 # char_span() creates a Span from these character indices
                 # UMLS CUI should work well as the label here
                 span = doc.char_span(start_char_idx, end_char_idx, label = cui_label_value)
+                
+                # before we add this, let's make sure that this entity does not overlap any tokens added thus far
+                candidate_token_indexes = set(range(span.start, span.end))
+                
+                # check the intersection and skip this if there is any overlap
+                if len(tokens_in_ents_set.intersection(candidate_token_indexes)) > 0:
+                    continue
+                    
+                # track this to make sure we do not introduce overlap later
+                tokens_in_ents_set.update(candidate_token_indexes)
+                
                 # add some custom metadata to the spans
                 span._.similarity = ngram_match_dict['similarity']
                 span._.semtypes = ngram_match_dict['semtypes']