Skip to content
This repository was archived by the owner on Jul 28, 2025. It is now read-only.

Conversation

@mart-r
Copy link
Collaborator

@mart-r mart-r commented Sep 3, 2024

The underlying issue presented in models sometimes being unable to recognise a concept where the same model would recognise an incorrectly typed name in the exact same context.

A few more details as to how I came onto this issue Tested with a few different models: - [1] The 2022/2023 GSTT/KCH trained model - [2] The AU model (where I first saw the issue) - [3] The 2024-06 GSTT-trained model I ran with 2 separate "documents":
Patient was diagnosed with diabetes based on previous findings

And

Patient was diagnosed with diabetis based on previous findings

(Note the typo of diabetis instead of diabetes in the 2nd).

Some models ([1] and [3]) were able to correctly identify the 2nd (i.e typo'd) version, but not the 1st (i.e correctly typed version).
Other models ([2]) didn't identify either.

Turned out the issue was as follows:

  • The Vocab based NER was checking against 2 name versions for a token
    • The normalised token
    • The lower case token
  • The first of the name versions that was in cdb.snames (the subnames) was used going forward
    • So the normalised token will be checked first, if it is a subname, it'll be used
    • Only if the normalised token was not a subname would the lower case token be checked

This caused the following issue:

  • When looking at diabetes, the normalised name was diabete
  • And this name was in the CDB's subnames
  • As such, it was used as the name of the concept, rather than diabetes itself

This PR provides the following fix:

  • Checks if either of the name variants are in subnames (cdb.snames) or actual concept names (cdb.name2cuis)
  • If a name is in the concept names, it will be used
  • Otherwise the name that was in subnames will be used (if one of them was in subnames)
  • NOTE:
    • Currently preference is on the normalised name
    • I.e if both names are a concept name or a subname, the normalised name is used
    • But perhaps we should do this the other way around? I don't really know, but that was the preference before so I left it the same.

@tomolopolis
Copy link
Member

Copy link
Member

@tomolopolis tomolopolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - great catch. Makes sense to keep with prior implementation preference to use normalised name in both names and snames, but might also be interesting to compare the difference somehow if there's any performance gain with the latter choice.

I'd imagine its pretty edge case so shouldn't be too apparent.

@mart-r mart-r merged commit 394e17b into master Sep 17, 2024
@mart-r mart-r deleted the CU-8695m5q4x-fix-name-versions branch November 18, 2024 16:23
alhendrickson pushed a commit to CogStack/cogstack-nlp that referenced this pull request Jul 1, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants