Skip to content

[BUG] spacy component ValueError: "A token can only be part of one entity" #60

Open
@burgersmoke

Description

@burgersmoke

Describe the bug
When QuickUMLS concept matches occur over the same token, Spacy reports an error like the

To Reproduce
Using QuickUMLS version 1.5 or higher, run the following sample. Note that if the matching threshold is set higher (e.g. 1.0) this exception may not occur.


nlp = spacy.load('en_core_web_sm')

quickumls_component = SpacyQuickUMLS(nlp, 'PATH_TO_QUICKUMLS_DATA', threshold = 0.25)

nlp.add_pipe(quickumls_component)

doc = nlp('Pt c/o shortness of breath, chest pain, nausea, vomiting, diarrrhea')

**Environment **

  • OS: Any (@soldni got this in Linux and @burgersmoke got this in Windows)
  • QuickUMLS version 1.5 (upcoming)
  • UMLS version 2019AA
  • SpaCy 2.3

Additional context
@soldni originally reported in this pull request.

The comments are reproduced here:

I was doing some tests and notice that, in the latest version of spaCy (2.3), I get an error if two entities overlap:

Traceback (most recent call last):
File "test.py", line 10, in
doc = nlp('Pt c/o shortness of breath, chest pain, nausea, vomiting, diarrrhea')
File "/home/ubuntu/anaconda3/envs/quickumls/lib/python3.7/site-packages/spacy/language.py", line 449, in call
doc = proc(doc, **component_cfg.get(name, {}))
File "/home/ubuntu/qumls_1.4/QuickUMLS/quickumls/spacy_component.py", line 78, in call
doc.ents = list(doc.ents) + [span]
File "doc.pyx", line 553, in spacy.tokens.doc.Doc.ents.set
ValueError: [E103] Trying to set conflicting doc.ents: '(8, 10, 'C0008031')' and '(8, 10, 'C2926613')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.
According to this GitHub issue, this seems to be the expected behavior since at least spaCy (2.1.3)

I am planning to solve this by either (a) adding a custom extension type called quickumls to docs, or (b) have each span be a list of matches. Any preference? In the first case, you'd access matches as follows:

for ent in doc..quickumls:
print(ent.text, ent.label
, ent..semtypes, ent..similarity)
In the latter, you'd use the following syntax:

for ent in doc.ents:
for match in ent._.quickumls:
print(ent.text, match.cui, match.semtypes, match.similarity)
I personally prefer the second one, as it makes more sense to me to just have an entity with multiple labels, but please do let me know which one you think would make more sense.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions