Description
Describe the bug
When QuickUMLS concept matches occur over the same token, Spacy reports an error like the
To Reproduce
Using QuickUMLS version 1.5 or higher, run the following sample. Note that if the matching threshold is set higher (e.g. 1.0) this exception may not occur.
nlp = spacy.load('en_core_web_sm')
quickumls_component = SpacyQuickUMLS(nlp, 'PATH_TO_QUICKUMLS_DATA', threshold = 0.25)
nlp.add_pipe(quickumls_component)
doc = nlp('Pt c/o shortness of breath, chest pain, nausea, vomiting, diarrrhea')
**Environment **
- OS: Any (@soldni got this in Linux and @burgersmoke got this in Windows)
- QuickUMLS version 1.5 (upcoming)
- UMLS version 2019AA
- SpaCy 2.3
Additional context
@soldni originally reported in this pull request.
The comments are reproduced here:
I was doing some tests and notice that, in the latest version of spaCy (2.3), I get an error if two entities overlap:
Traceback (most recent call last):
File "test.py", line 10, in
doc = nlp('Pt c/o shortness of breath, chest pain, nausea, vomiting, diarrrhea')
File "/home/ubuntu/anaconda3/envs/quickumls/lib/python3.7/site-packages/spacy/language.py", line 449, in call
doc = proc(doc, **component_cfg.get(name, {}))
File "/home/ubuntu/qumls_1.4/QuickUMLS/quickumls/spacy_component.py", line 78, in call
doc.ents = list(doc.ents) + [span]
File "doc.pyx", line 553, in spacy.tokens.doc.Doc.ents.set
ValueError: [E103] Trying to set conflicting doc.ents: '(8, 10, 'C0008031')' and '(8, 10, 'C2926613')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.
According to this GitHub issue, this seems to be the expected behavior since at least spaCy (2.1.3)I am planning to solve this by either (a) adding a custom extension type called quickumls to docs, or (b) have each span be a list of matches. Any preference? In the first case, you'd access matches as follows:
for ent in doc..quickumls:
print(ent.text, ent.label, ent..semtypes, ent..similarity)
In the latter, you'd use the following syntax:for ent in doc.ents:
for match in ent._.quickumls:
print(ent.text, match.cui, match.semtypes, match.similarity)
I personally prefer the second one, as it makes more sense to me to just have an entity with multiple labels, but please do let me know which one you think would make more sense.