Closed
Description
Description of the problem
ConveRT and also other language models we have in our pipeline split words during tokenization into sub-words. DIETClassifier
assigns different entities to the individual sub-words.
Example:
{
"text": "Aarhus",
"entities": [
{
"start": 0,
"end": 6,
"value": "Aarhus",
"entity": "city"
}
],
"predicted_entities": [
{
"entity": "iata",
"start": 0,
"end": 3,
"extractor": "DIETClassifier",
"value": "Aar"
},
{
"entity": "city",
"start": 3,
"end": 6,
"extractor": "DIETClassifier",
"value": "hus"
}
]
Overview of the solution:
It should not be possible to assign two different entities to one word/token. We should add a sanity check that permits double assignments. We might want to keep the assignment with the higher confidence.
We need to check if this also happens with the CRFEntityExtractor
.