Skip to content

WARNING:stanza:Can not find lemma: ftb from official model list. Ignoring it. [after updating to Stanza 1.5.1] #1284

@mrgransky

Description

@mrgransky

I did pip install stanza --upgrade --quiet and now I have 1.5.1. Configuring Multilingual Pipeline now returns 2 following warnings:

WARNING:stanza:Can not find lemma: ftb from official model list. Ignoring it.
WARNING:stanza:Can not find pos: ftb from official model list. Ignoring it.

To Reproduce

lang_id_config = {"langid_lang_subset": ['fi', 'sv', 'de', 'ru', 'en', 'da',]}
lang_configs = {
    "en": {"processors":"tokenize,lemma,pos", "package":'lines',"tokenize_no_ssplit":True},
	"sv": {"processors":"tokenize,lemma,pos","tokenize_no_ssplit":True},
	"da": {"processors":"tokenize,lemma,pos","tokenize_no_ssplit":True},
	"ru": {"processors":"tokenize,lemma,pos","tokenize_no_ssplit":True},
	"fi": {"processors":"tokenize,lemma,pos,mwt", "package":'ftb',"tokenize_no_ssplit":True},
	"de": {"processors":"tokenize,lemma,pos", "package":'hdt',"tokenize_no_ssplit":True},
}
smp = MultilingualPipeline(
    lang_id_config=lang_id_config,
    lang_configs=lang_configs,
    download_method=DownloadMethod.REUSE_RESOURCES,
)

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.1.json:
287k/? [00:00<00:00, 13.0MB/s]
Downloading https://huggingface.co/stanfordnlp/stanza-multilingual/resolve/v1.5.1/models/langid/ud.pt: 100%
9.07M/9.07M [00:00<00:00, 41.9MB/s]

INFO:stanza:Loading these models for language: multilingual ():
=======================
| Processor | Package |
-----------------------
| langid    | ud      |
=======================

INFO:stanza:Using device: cuda
INFO:stanza:Loading: langid
INFO:stanza:Done loading processors!

Then passing my document d:

d ="""
I go to school everyday with majority of my best friends.
"""

into multilingual pipeline:
all_ = smp(d)
returns None for .lemma since it doen't exist as keys for my <class 'stanza.models.common.doc.Document'>:

for _, vsnt in enumerate(all_.sentences): 
    for _, vw in enumerate(vsnt.words):
        print(vw.lemma)
None
None
None
None
None
None
None
None
None
None
None
None

Here's my all_:

[
  [
    {
      "id": 1,
      "text": "I",
      "start_char": 1,
      "end_char": 2
    },
    {
      "id": 2,
      "text": "go",
      "start_char": 3,
      "end_char": 5
    },
    {
      "id": 3,
      "text": "to",
      "start_char": 6,
      "end_char": 8
    },
    {
      "id": 4,
      "text": "school",
      "start_char": 9,
      "end_char": 15
    },
    {
      "id": 5,
      "text": "everyday",
      "start_char": 16,
      "end_char": 24
    },
    {
      "id": 6,
      "text": "with",
      "start_char": 25,
      "end_char": 29
    },
    {
      "id": 7,
      "text": "majority",
      "start_char": 30,
      "end_char": 38
    },
    {
      "id": 8,
      "text": "of",
      "start_char": 39,
      "end_char": 41
    },
    {
      "id": 9,
      "text": "my",
      "start_char": 42,
      "end_char": 44
    },
    {
      "id": 10,
      "text": "best",
      "start_char": 45,
      "end_char": 49
    },
    {
      "id": 11,
      "text": "friends",
      "start_char": 50,
      "end_char": 57
    },
    {
      "id": 12,
      "text": ".",
      "start_char": 57,
      "end_char": 58
    }
  ]
]

Expected behavior
As of Stanza 1.5.0 (pip install stanza==1.5 --quiet), here is my expected behavior which I used to get:

all_ = smp(d)
Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/tokenize/lines.pt: 100%
629k/629k [00:00<00:00, 8.91MB/s]
Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/pos/lines.pt: 100%
33.3M/33.3M [00:00<00:00, 60.7MB/s]
Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/lemma/lines.pt: 100%
2.51M/2.51M [00:00<00:00, 21.5MB/s]
Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/pretrain/lines.pt: 100%
107M/107M [00:01<00:00, 63.7MB/s]
Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/forward_charlm/1billion.pt: 100%
22.7M/22.7M [00:00<00:00, 45.8MB/s]
Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/backward_charlm/1billion.pt: 100%
22.7M/22.7M [00:00<00:00, 47.0MB/s]

INFO:stanza:Loading these models for language: en (English):
=======================
| Processor | Package |
-----------------------
| tokenize  | lines   |
| pos       | lines   |
| lemma     | lines   |
=======================

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Done loading processors!

print(all_)
[
  [
    {
      "id": 1,
      "text": "I",
      "lemma": "I",
      "upos": "PRON",
      "xpos": "PERS-P1SG-NOM",
      "feats": "Case=Nom|Number=Sing|Person=1|PronType=Prs",
      "start_char": 1,
      "end_char": 2
    },
    {
      "id": 2,
      "text": "go",
      "lemma": "go",
      "upos": "VERB",
      "xpos": "PRES",
      "feats": "Mood=Ind|Tense=Pres|VerbForm=Fin",
      "start_char": 3,
      "end_char": 5
    },
    {
      "id": 3,
      "text": "to",
      "lemma": "to",
      "upos": "ADP",
      "start_char": 6,
      "end_char": 8
    },
    {
      "id": 4,
      "text": "school",
      "lemma": "school",
      "upos": "NOUN",
      "xpos": "SG-NOM",
      "feats": "Number=Sing",
      "start_char": 9,
      "end_char": 15
    },
    {
      "id": 5,
      "text": "everyday",
      "lemma": "everyday",
      "upos": "ADV",
      "start_char": 16,
      "end_char": 24
    },
    {
      "id": 6,
      "text": "with",
      "lemma": "with",
      "upos": "ADP",
      "start_char": 25,
      "end_char": 29
    },
    {
      "id": 7,
      "text": "majority",
      "lemma": "majority",
      "upos": "NOUN",
      "xpos": "SG-NOM",
      "feats": "Number=Sing",
      "start_char": 30,
      "end_char": 38
    },
    {
      "id": 8,
      "text": "of",
      "lemma": "of",
      "upos": "ADP",
      "start_char": 39,
      "end_char": 41
    },
    {
      "id": 9,
      "text": "my",
      "lemma": "I",
      "upos": "PRON",
      "xpos": "P1SG-GEN",
      "feats": "Number=Sing|Person=1|Poss=Yes|PronType=Prs",
      "start_char": 42,
      "end_char": 44
    },
    {
      "id": 10,
      "text": "best",
      "lemma": "good",
      "upos": "ADJ",
      "xpos": "SPL",
      "feats": "Degree=Sup",
      "start_char": 45,
      "end_char": 49
    },
    {
      "id": 11,
      "text": "friends",
      "lemma": "friend",
      "upos": "NOUN",
      "xpos": "PL-NOM",
      "feats": "Number=Plur",
      "start_char": 50,
      "end_char": 57
    },
    {
      "id": 12,
      "text": ".",
      "lemma": ".",
      "upos": "PUNCT",
      "xpos": "Period",
      "start_char": 57,
      "end_char": 58
    }
  ]
]

for _, vsnt in enumerate(all_.sentences): 
    for _, vw in enumerate(vsnt.words):
        print(vw.lemma)

I
go
to
school
everyday
with
majority
of
I
good
friend
.

Environment (please complete the following information):

  • OS: [Ubuntu 16.04]
  • Python version: [Python 3.10.12 from Colab: !python --version]
  • Stanza version: [1.5.1 vs. 1.5.0]

What am I actually doing wrong here?

Cheers,

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions