One word two entity labels

**Description of the problem**
ConveRT and also other language models we have in our pipeline split words during tokenization into sub-words. `DIETClassifier` assigns different entities to the individual sub-words.

Example:
```
{
    "text": "Aarhus",
    "entities": [
        {
            "start": 0,
            "end": 6,
            "value": "Aarhus",
            "entity": "city"
        }
    ],
    "predicted_entities": [
        {
            "entity": "iata",
            "start": 0,
            "end": 3,
            "extractor": "DIETClassifier",
            "value": "Aar"
        },
        {
            "entity": "city",
            "start": 3,
            "end": 6,
            "extractor": "DIETClassifier",
            "value": "hus"
        }
    ]
```

**Overview of the solution**:
It should not be possible to assign two different entities to one word/token. We should add a sanity check that permits double assignments. We might want to keep the assignment with the higher confidence.

We need to check if this also happens with the `CRFEntityExtractor`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One word two entity labels #5475

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development