Skip to content

One word two entity labels #5475

Closed
@tabergma

Description

Description of the problem
ConveRT and also other language models we have in our pipeline split words during tokenization into sub-words. DIETClassifier assigns different entities to the individual sub-words.

Example:

{
    "text": "Aarhus",
    "entities": [
        {
            "start": 0,
            "end": 6,
            "value": "Aarhus",
            "entity": "city"
        }
    ],
    "predicted_entities": [
        {
            "entity": "iata",
            "start": 0,
            "end": 3,
            "extractor": "DIETClassifier",
            "value": "Aar"
        },
        {
            "entity": "city",
            "start": 3,
            "end": 6,
            "extractor": "DIETClassifier",
            "value": "hus"
        }
    ]

Overview of the solution:
It should not be possible to assign two different entities to one word/token. We should add a sanity check that permits double assignments. We might want to keep the assignment with the higher confidence.

We need to check if this also happens with the CRFEntityExtractor.

Metadata

Assignees

Labels

area:rasa-oss 🎡Anything related to the open source Rasa frameworktype:bug 🐛Inconsistencies or issues which will cause an issue or problem for users or implementors.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions