CU-8695m5q4x: Fix issues detecting 1-token concepts #485

mart-r · 2024-09-03T11:48:02Z

The underlying issue presented in models sometimes being unable to recognise a concept where the same model would recognise an incorrectly typed name in the exact same context.

A few more details as to how I came onto this issue

Tested with a few different models: - [1] The 2022/2023 GSTT/KCH trained model - [2] The AU model (where I first saw the issue) - [3] The 2024-06 GSTT-trained model I ran with 2 separate "documents":

Patient was diagnosed with diabetes based on previous findings

And

Patient was diagnosed with diabetis based on previous findings

(Note the typo of diabetis instead of diabetes in the 2nd).

Some models ([1] and [3]) were able to correctly identify the 2nd (i.e typo'd) version, but not the 1st (i.e correctly typed version).
Other models ([2]) didn't identify either.

Turned out the issue was as follows:

The Vocab based NER was checking against 2 name versions for a token
- The normalised token
- The lower case token
The first of the name versions that was in cdb.snames (the subnames) was used going forward
- So the normalised token will be checked first, if it is a subname, it'll be used
- Only if the normalised token was not a subname would the lower case token be checked

This caused the following issue:

When looking at diabetes, the normalised name was diabete
And this name was in the CDB's subnames
As such, it was used as the name of the concept, rather than diabetes itself

This PR provides the following fix:

Checks if either of the name variants are in subnames (cdb.snames) or actual concept names (cdb.name2cuis)
If a name is in the concept names, it will be used
Otherwise the name that was in subnames will be used (if one of them was in subnames)
NOTE:
- Currently preference is on the normalised name
- I.e if both names are a concept name or a subname, the normalised name is used
- But perhaps we should do this the other way around? I don't really know, but that was the preference before so I left it the same.

tomolopolis · 2024-09-03T11:48:07Z

Task linked: CU-8695m5q4x Fix issue with models sometimes not being able to detect concepts unless there's a typo

tomolopolis

lgtm - great catch. Makes sense to keep with prior implementation preference to use normalised name in both names and snames, but might also be interesting to compare the difference somehow if there's any performance gain with the latter choice.

I'd imagine its pretty edge case so shouldn't be too apparent.

)

CU-8695m5q4x: Fix issues detecting 1-token concepts

e63f5a1

tomolopolis approved these changes Sep 13, 2024

View reviewed changes

mart-r merged commit 394e17b into master Sep 17, 2024

mart-r mentioned this pull request Nov 14, 2024

RobertaTokenizerFast object has no attribute 'split_special_tokens' #501

Closed

mart-r deleted the CU-8695m5q4x-fix-name-versions branch November 18, 2024 16:23

alhendrickson pushed a commit to CogStack/cogstack-nlp that referenced this pull request Jul 1, 2025

CU-8695m5q4x: Fix issues detecting 1-token concepts (CogStack/MedCAT#485

6bfdf96

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CU-8695m5q4x: Fix issues detecting 1-token concepts #485

CU-8695m5q4x: Fix issues detecting 1-token concepts #485

Uh oh!

mart-r commented Sep 3, 2024 •

edited

Loading

Uh oh!

tomolopolis commented Sep 3, 2024

Uh oh!

tomolopolis left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CU-8695m5q4x: Fix issues detecting 1-token concepts #485

CU-8695m5q4x: Fix issues detecting 1-token concepts #485

Uh oh!

Conversation

mart-r commented Sep 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomolopolis commented Sep 3, 2024

Uh oh!

tomolopolis left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mart-r commented Sep 3, 2024 •

edited

Loading