Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phraser max NPMI score > 1 #3042

Open
joachimdb opened this issue Feb 9, 2021 · 2 comments
Open

Phraser max NPMI score > 1 #3042

joachimdb opened this issue Feb 9, 2021 · 2 comments
Labels
bug Issue described a bug need info Not enough information for reproduce an issue, need more info from author

Comments

@joachimdb
Copy link

Problem description

I trained a NMPI phraser on the latest wikipedia dump. It is my understanding that scores should be <= 1.0, but I get a higher score.

Steps/code/corpus to reproduce

from gensim.corpora import WikiCorpus
from gensim.models import Phrases
from gensim.models.phrases import Phraser

wiki_corpus = WikiCorpus("enwiki-latest-pages-articles-multistream.xml.bz2", dictionary={})

ENGLISH_CONNECTOR_WORDS = frozenset(
    " a an the "  # articles; we never care about these in MWEs
    " for of with without at from to in on by "  # prepositions; incomplete on purpose, to minimize FNs
    " and or "  # conjunctions; incomplete on purpose, to minimize FNs
    .split()
)

phrases = Phrases(wiki_corpus.get_texts(), scoring='npmi', threshold=0.75, min_count=5, common_terms=ENGLISH_CONNECTOR_WORDS, max_vocab_size=80000000)
phraser = Phraser(phrases)

Then:

In[2]: max(phraser.phrasegrams.values())
Out[2]: 1.2003355030351979

Versions

Linux-3.10.0-1160.6.1.el7.x86_64-x86_64-with-centos-7.9.2009-Core
Python 3.7.9 (default, Aug 31 2020, 12:42:55)
[GCC 7.3.0]
Bits 64
NumPy 1.19.2
gensim 3.8.0
FAST_VERSION 1
@piskvorky
Copy link
Owner

Yeah that's weird. AFAIR the NMPI scores should be in <-1, 1>. Can you check in Gensim 4.0.0 please? (pip install -U gensim)

Could you inspect what the underlying words and word counts are, for the affected bigram? Maybe that will shed some light, help us debug. Thanks.

@piskvorky piskvorky added bug Issue described a bug need info Not enough information for reproduce an issue, need more info from author labels Feb 9, 2021
@piskvorky
Copy link
Owner

piskvorky commented Feb 9, 2021

Also, looking at the npmi docs, I don't understand why the formula talks about prop (?), but then refers to prob on the same line. Weird too.

EDIT: that formula seems to have been introduced in 5677ab3#diff-b792e36e52289f193a1ef84cc9f58884b95dc1a29bdb21ad8f7769daf0a3dbb0R670 . I'm leaning toward a simple typo – reviews were more lax at that time than they are now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug need info Not enough information for reproduce an issue, need more info from author
Projects
None yet
Development

No branches or pull requests

2 participants