id2lang in tokenization_xlm.py should be int, and removing hardcoding

In https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_xlm.py#L78 we have:
```
        "id2lang": {"0": "de", "1": "en"},
        "lang2id": {"de": 0, "en": 1},
```
and then:
```
        lang2id (:obj:`Dict[str, int]`, `optional`, defaults to :obj:`None`):
            Dictionary mapping languages string identifiers to their IDs.
        id2lang (:obj:`Dict[int, str`, `optional`, defaults to :obj:`None`):
```

So it should be:
```
        "id2lang": {0: "de", 1: "en"},
        "lang2id": {"de": 0, "en": 1},
```
All other entries need this change too.

The problem hasn't been detected until now since they were used to only count the number of languages it seems.

I need to pass src/tgt languages to the tokenizer I'm porting from fairseq, so I was looking at how to do that and `id2lang` seems to fit the purpose. But I actually need to look them up by `int` id, that's how I saw the problem.

But I'm also not sure why do we need to hardcode the reversal, when it can be done in 1 line of code? Which would also remove this assertion code:

```
        self.lang2id = lang2id
        self.id2lang = id2lang
        if lang2id is not None and id2lang is not None:
            assert len(lang2id) == len(id2lang)
```

Further we don't even need to hardcode the ids. Replace:
```
       "id2lang": {0: "de", 1: "en"},
```
with:
```
       "id2lang": ["de", "en"]
```
So all we need is one of the two entries, and now generate the 2 lookup dicts on the fly.

And since it's no longer `id2lang` semantically, probably renaming it to just `langs` would be more appropriate.

I think I will use this approach regardless of the outcome of this issue.

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

id2lang in tokenization_xlm.py should be int, and removing hardcoding #6734

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

id2lang in tokenization_xlm.py should be int, and removing hardcoding #6734

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions