Skip to content

Allow QuickUMLS to be used as component in spacy pipeline #57

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 3, 2020

Conversation

burgersmoke
Copy link
Contributor

This can be used as an entity matcher for UMLS concepts in other modular spacy pipelines. This is already being used in an operational capacity for syndromic surveillance in a healthcare system. More info on this can be provided upon request.

This is in response to this issue:
#40

…spacy pipeline. This can be used as an entity matcher for UMLS concepts in other modular spacy pipelines. This is already being used in an operational capacity for syndromic surveillance.
Copy link
Member

@soldni soldni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @burgersmoke,

Thank you so much for this! It looks awesome; I've left a couple minor comments. Could you also add a section in README.md with information about how to use this? personally, I'd move spacy_example_pipeline.py a section at the bottom of the readme file, but I'll leave it up to you.

Thank you again, and looking forward to merging this changes in! This is a feature big enough to roll it out as v.1.5

@burgersmoke
Copy link
Contributor Author

burgersmoke commented Aug 10, 2020

Great. I appreciate all of this feedback since I know this was a rough PR.

One question: You mention moving spacy_example_pipeline.py. Are you proposing that it goes into /quickumls or into another location? Maybe something like /examples?

Nevermind. I did not understand the previous comment, but understood it later. I've moved the example code to the README instead of the example .py file.

Now that I have your feedback, I'll make these changes tonight or Tuesday and update the PR.

@ljak
Copy link

ljak commented Aug 10, 2020

Went through it quickly, seems nice for first iteration ! Thanks for work.

…n QuickUMLS and SpacyQuickUMLS. Fixed keyword arguments to QuickUMLS when creating a spacy component. Added documentation as well. Removing previous standalone example Python file and instead adding an example of a QuickUMLS spacy pipleline to the README.
@burgersmoke
Copy link
Contributor Author

OK, I think this is ready to go now with b06b882. Besides the items above, I also added documentation, fixed some comments and fixed the hard-coded QuickUMLS arguments to be passed in as **kwargs instead. Now I feel comfortable with this PR. :)

@burgersmoke burgersmoke requested a review from soldni August 11, 2020 21:45
@burgersmoke
Copy link
Contributor Author

Any update on this? I believe I've resolved the code issue and remaining questions. Are there any other questions about this change I can answer? I'm wanting to move this forward since we're planning to leverage this to make medspacy and QuickUMLS better with some mutually beneficial upcoming work.

Copy link
Member

@soldni soldni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in reviewing this! The new changes look good to me.

@soldni soldni merged commit c0b5db0 into Georgetown-IR-Lab:master Sep 3, 2020
@burgersmoke
Copy link
Contributor Author

Fantastic. Thank you @soldni !

@soldni
Copy link
Member

soldni commented Sep 3, 2020

@burgersmoke thank you for your hard work! will push out a proper release (v.1.5) momentarily.

@burgersmoke
Copy link
Contributor Author

My pleasure. There's more to come as QuickUMLS is an important part of our plans for medspacy. I'll continue to follow up with you and anyone else interested.

@soldni
Copy link
Member

soldni commented Sep 3, 2020

I was doing some tests and notice that, in the latest version of spaCy (2.3), I get an error if two entities overlap:

Traceback (most recent call last):
  File "test.py", line 10, in <module>
    doc = nlp('Pt c/o shortness of breath, chest pain, nausea, vomiting, diarrrhea')
  File "/home/ubuntu/anaconda3/envs/quickumls/lib/python3.7/site-packages/spacy/language.py", line 449, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))
  File "/home/ubuntu/qumls_1.4/QuickUMLS/quickumls/spacy_component.py", line 78, in __call__
    doc.ents = list(doc.ents) + [span]
  File "doc.pyx", line 553, in spacy.tokens.doc.Doc.ents.__set__
ValueError: [E103] Trying to set conflicting doc.ents: '(8, 10, 'C0008031')' and '(8, 10, 'C2926613')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

According to this GitHub issue, this seems to be the expected behavior since at least spaCy (2.1.3)

I am planning to solve this by either (a) adding a custom extension type called quickumls to docs, or (b) have each span be a list of matches. Any preference? In the first case, you'd access matches as follows:

for ent in doc._.quickumls:
    print(ent.text, ent.label_, ent._.semtypes, ent._.similarity)

In the latter, you'd use the following syntax:

for ent in doc.ents:
    for match in ent._.quickumls:
        print(ent.text, match.cui, match.semtypes, match.similarity)

I personally prefer the second one, as it makes more sense to me to just have an entity with multiple labels, but please do let me know which one you think would make more sense.

@burgersmoke
Copy link
Contributor Author

You are absolutely correct and I debated raising it earlier, but I wanted to take steps towards the component being included first so that we could have this conversation. I've opened this as a new issue here. I'll add my initial thoughts here:

#60

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants