-
Notifications
You must be signed in to change notification settings - Fork 911
Closed
Labels
Description
I did pip install stanza --upgrade --quiet
and now I have 1.5.1
. Configuring Multilingual Pipeline now returns 2 following warnings:
WARNING:stanza:Can not find lemma: ftb from official model list. Ignoring it.
WARNING:stanza:Can not find pos: ftb from official model list. Ignoring it.
To Reproduce
lang_id_config = {"langid_lang_subset": ['fi', 'sv', 'de', 'ru', 'en', 'da',]}
lang_configs = {
"en": {"processors":"tokenize,lemma,pos", "package":'lines',"tokenize_no_ssplit":True},
"sv": {"processors":"tokenize,lemma,pos","tokenize_no_ssplit":True},
"da": {"processors":"tokenize,lemma,pos","tokenize_no_ssplit":True},
"ru": {"processors":"tokenize,lemma,pos","tokenize_no_ssplit":True},
"fi": {"processors":"tokenize,lemma,pos,mwt", "package":'ftb',"tokenize_no_ssplit":True},
"de": {"processors":"tokenize,lemma,pos", "package":'hdt',"tokenize_no_ssplit":True},
}
smp = MultilingualPipeline(
lang_id_config=lang_id_config,
lang_configs=lang_configs,
download_method=DownloadMethod.REUSE_RESOURCES,
)
INFO:stanza:Checking for updates to resources.json in case models have been updated. Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.1.json:
287k/? [00:00<00:00, 13.0MB/s]
Downloading https://huggingface.co/stanfordnlp/stanza-multilingual/resolve/v1.5.1/models/langid/ud.pt: 100%
9.07M/9.07M [00:00<00:00, 41.9MB/s]
INFO:stanza:Loading these models for language: multilingual ():
=======================
| Processor | Package |
-----------------------
| langid | ud |
=======================
INFO:stanza:Using device: cuda
INFO:stanza:Loading: langid
INFO:stanza:Done loading processors!
Then passing my document d
:
d ="""
I go to school everyday with majority of my best friends.
"""
into multilingual pipeline:
all_ = smp(d)
returns None
for .lemma
since it doen't exist as keys for my <class 'stanza.models.common.doc.Document'>
:
for _, vsnt in enumerate(all_.sentences):
for _, vw in enumerate(vsnt.words):
print(vw.lemma)
None
None
None
None
None
None
None
None
None
None
None
None
Here's my all_
:
[
[
{
"id": 1,
"text": "I",
"start_char": 1,
"end_char": 2
},
{
"id": 2,
"text": "go",
"start_char": 3,
"end_char": 5
},
{
"id": 3,
"text": "to",
"start_char": 6,
"end_char": 8
},
{
"id": 4,
"text": "school",
"start_char": 9,
"end_char": 15
},
{
"id": 5,
"text": "everyday",
"start_char": 16,
"end_char": 24
},
{
"id": 6,
"text": "with",
"start_char": 25,
"end_char": 29
},
{
"id": 7,
"text": "majority",
"start_char": 30,
"end_char": 38
},
{
"id": 8,
"text": "of",
"start_char": 39,
"end_char": 41
},
{
"id": 9,
"text": "my",
"start_char": 42,
"end_char": 44
},
{
"id": 10,
"text": "best",
"start_char": 45,
"end_char": 49
},
{
"id": 11,
"text": "friends",
"start_char": 50,
"end_char": 57
},
{
"id": 12,
"text": ".",
"start_char": 57,
"end_char": 58
}
]
]
Expected behavior
As of Stanza 1.5.0
(pip install stanza==1.5 --quiet
), here is my expected behavior which I used to get:
all_ = smp(d)
Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/tokenize/lines.pt: 100%
629k/629k [00:00<00:00, 8.91MB/s]
Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/pos/lines.pt: 100%
33.3M/33.3M [00:00<00:00, 60.7MB/s]
Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/lemma/lines.pt: 100%
2.51M/2.51M [00:00<00:00, 21.5MB/s]
Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/pretrain/lines.pt: 100%
107M/107M [00:01<00:00, 63.7MB/s]
Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/forward_charlm/1billion.pt: 100%
22.7M/22.7M [00:00<00:00, 45.8MB/s]
Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.5.0/models/backward_charlm/1billion.pt: 100%
22.7M/22.7M [00:00<00:00, 47.0MB/s]
INFO:stanza:Loading these models for language: en (English):
=======================
| Processor | Package |
-----------------------
| tokenize | lines |
| pos | lines |
| lemma | lines |
=======================
INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Done loading processors!
print(all_)
[
[
{
"id": 1,
"text": "I",
"lemma": "I",
"upos": "PRON",
"xpos": "PERS-P1SG-NOM",
"feats": "Case=Nom|Number=Sing|Person=1|PronType=Prs",
"start_char": 1,
"end_char": 2
},
{
"id": 2,
"text": "go",
"lemma": "go",
"upos": "VERB",
"xpos": "PRES",
"feats": "Mood=Ind|Tense=Pres|VerbForm=Fin",
"start_char": 3,
"end_char": 5
},
{
"id": 3,
"text": "to",
"lemma": "to",
"upos": "ADP",
"start_char": 6,
"end_char": 8
},
{
"id": 4,
"text": "school",
"lemma": "school",
"upos": "NOUN",
"xpos": "SG-NOM",
"feats": "Number=Sing",
"start_char": 9,
"end_char": 15
},
{
"id": 5,
"text": "everyday",
"lemma": "everyday",
"upos": "ADV",
"start_char": 16,
"end_char": 24
},
{
"id": 6,
"text": "with",
"lemma": "with",
"upos": "ADP",
"start_char": 25,
"end_char": 29
},
{
"id": 7,
"text": "majority",
"lemma": "majority",
"upos": "NOUN",
"xpos": "SG-NOM",
"feats": "Number=Sing",
"start_char": 30,
"end_char": 38
},
{
"id": 8,
"text": "of",
"lemma": "of",
"upos": "ADP",
"start_char": 39,
"end_char": 41
},
{
"id": 9,
"text": "my",
"lemma": "I",
"upos": "PRON",
"xpos": "P1SG-GEN",
"feats": "Number=Sing|Person=1|Poss=Yes|PronType=Prs",
"start_char": 42,
"end_char": 44
},
{
"id": 10,
"text": "best",
"lemma": "good",
"upos": "ADJ",
"xpos": "SPL",
"feats": "Degree=Sup",
"start_char": 45,
"end_char": 49
},
{
"id": 11,
"text": "friends",
"lemma": "friend",
"upos": "NOUN",
"xpos": "PL-NOM",
"feats": "Number=Plur",
"start_char": 50,
"end_char": 57
},
{
"id": 12,
"text": ".",
"lemma": ".",
"upos": "PUNCT",
"xpos": "Period",
"start_char": 57,
"end_char": 58
}
]
]
for _, vsnt in enumerate(all_.sentences):
for _, vw in enumerate(vsnt.words):
print(vw.lemma)
I
go
to
school
everyday
with
majority
of
I
good
friend
.
Environment (please complete the following information):
- OS: [Ubuntu 16.04]
- Python version: [Python 3.10.12 from Colab:
!python --version
] - Stanza version: [1.5.1 vs. 1.5.0]
What am I actually doing wrong here?
Cheers,
clemsciences