POS tagging inconsistent with length for french transformer model #12240
-
Hello ! When processing this sentence Here are 2 examples with different versions of the model done in a Linux environment with python 3.10. spacy-transformers == 1.2.0
spacy == 3.5.0
fr_dep_news_trf == 3.5.0 > doc = nlp("Je vais skier dans les Alpes de France cet hiver.")
> [(i.lemma_, i.pos_) for i in doc if i.text == "Alpes"]
[('Alpes', 'PROPN')]
> doc = nlp("Je vais skier dans les Alpes de France cet hiver. " *10)
> [(i.lemma_, i.pos_) for i in doc if i.text == "Alpes"]
[('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN'), ('alpe', 'NOUN')] With another version, there is far less wrong predictions but still some at some point. spacy-transformers == 1.1.9
spacy == 3.4.4
fr_dep_news_trf == 3.4.0 > doc = nlp("Je vais skier dans les Alpes de France cet hiver.")
> [(i.lemma_, i.pos_) for i in doc if i.text == "Alpes"]
[('Alpes', 'PROPN')]
> doc = nlp("Je vais skier dans les Alpes de France cet hiver. " *10)
> [(i.lemma_, i.pos_) for i in doc if i.text == "Alpes"]
[('alpe', 'NOUN'), ('Alpes', 'PROPN'), ('Alpes', 'PROPN'), ('Alpes', 'PROPN'), ('Alpes', 'PROPN'), ('Alpes', 'PROPN'), ('Alpes', 'PROPN'), ('Alpes', 'PROPN'), ('Alpes', 'PROPN'), ('alpe', 'NOUN')] I'd like to know if it is expected from the model or not. Like, is this just because I don't give it enough context or something else. Thank you for your help! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @ColleterVi, thanks for reporting this! This is likely due to "Alps" not appearing in the training data. Also, we encourage posting incorrect predictions from pretrained models in this master thread 🙂 |
Beta Was this translation helpful? Give feedback.
Hi @ColleterVi, thanks for reporting this! This is likely due to "Alps" not appearing in the training data. Also, we encourage posting incorrect predictions from pretrained models in this master thread 🙂