Skip to content

id2lang in tokenization_xlm.py should be int, and removing hardcoding #6734

Closed
@stas00

Description

@stas00

In https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_xlm.py#L78 we have:

        "id2lang": {"0": "de", "1": "en"},
        "lang2id": {"de": 0, "en": 1},

and then:

        lang2id (:obj:`Dict[str, int]`, `optional`, defaults to :obj:`None`):
            Dictionary mapping languages string identifiers to their IDs.
        id2lang (:obj:`Dict[int, str`, `optional`, defaults to :obj:`None`):

So it should be:

        "id2lang": {0: "de", 1: "en"},
        "lang2id": {"de": 0, "en": 1},

All other entries need this change too.

The problem hasn't been detected until now since they were used to only count the number of languages it seems.

I need to pass src/tgt languages to the tokenizer I'm porting from fairseq, so I was looking at how to do that and id2lang seems to fit the purpose. But I actually need to look them up by int id, that's how I saw the problem.

But I'm also not sure why do we need to hardcode the reversal, when it can be done in 1 line of code? Which would also remove this assertion code:

        self.lang2id = lang2id
        self.id2lang = id2lang
        if lang2id is not None and id2lang is not None:
            assert len(lang2id) == len(id2lang)

Further we don't even need to hardcode the ids. Replace:

       "id2lang": {0: "de", 1: "en"},

with:

       "id2lang": ["de", "en"]

So all we need is one of the two entries, and now generate the 2 lookup dicts on the fly.

And since it's no longer id2lang semantically, probably renaming it to just langs would be more appropriate.

I think I will use this approach regardless of the outcome of this issue.

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions