Incoherent topic word distributions after `malletmodel2ldamodel` #2069

Wolfi-101 · 2018-05-28T18:28:15Z

Hi everyone,
first off many thanks for providing such an awesome module! I am using gensim to do topic modeling with LDA and encountered the following bug/issue. I have already read about it in the mailing list, but apparently no issue has been created on Github.

Description

After training an LDA model with the gensim mallet wrapper I converted the model to a native gensim LDA model via the malletmodel2ldamodel function provided with the wrapper. Before and after the conversion the topic word distributions are quite different. The ldamallet version returns comprehensible topics with sensible weights, whereas the topic word distribution after conversion is nearly uniform, leading to topics without a clear focus.

I am assuming that the resulting topics are supposed to be at least somewhat similar before and after conversion. Am I doing something wrong? What could be causing this behaviour?

Steps/Code/Corpus to Reproduce

import gensim
from sklearn.datasets import fetch_20newsgroups

# select five quite distinct categories from the 20 newsgroups
cat = ['soc.religion.christian', 'comp.graphics', 'rec.motorcycles', 
       'sci.space', 'talk.politics.guns']

# keep and use only the main text
newsgroups_train = fetch_20newsgroups(subset='all', categories=cat,
                                      remove=('headers', 'footers', 'quotes'))

tokenized = [gensim.utils.simple_preprocess(doc) for doc in newsgroups_train.data]
dictionary = gensim.corpora.Dictionary(tokenized)
corpus = [dictionary.doc2bow(text) for text in tokenized]

lda_mallet = gensim.models.wrappers.ldamallet.LdaMallet(
        'c:/mallet/bin/mallet', corpus=corpus, 
        num_topics=5, id2word=dictionary, iterations=1000)

lda_gensim = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(
        lda_mallet, iterations=1000)

for topic in lda_mallet.show_topics(num_topics=5, num_words=10):
    print(topic)
for topic in lda_gensim.show_topics(num_topics=5, num_words=10):
    print(topic)

Expected Results

These are the results I get from the mallet wrapper using lda_mallet.show_topics(num_topics=5, num_words=10). Those are what one would expect considering the chosen categories from 20newsgroups:

(0, '0.021*"god" + 0.009*"people" + 0.007*"jesus" + 0.007*"church" + 0.006*"christ" + 0.005*"life" + 0.005*"christian" + 0.005*"bible" + 0.004*"christians" + 0.004*"man"')
(1, '0.014*"don" + 0.011*"ve" + 0.009*"good" + 0.008*"bike" + 0.007*"time" + 0.007*"back" + 0.007*"make" + 0.006*"ll" + 0.006*"problem" + 0.006*"thing"')
(2, '0.017*"space" + 0.006*"nasa" + 0.006*"earth" + 0.005*"system" + 0.005*"launch" + 0.004*"shuttle" + 0.004*"orbit" + 0.003*"years" + 0.003*"mission" + 0.003*"moon"')
(3, '0.012*"people" + 0.011*"gun" + 0.005*"guns" + 0.005*"government" + 0.005*"state" + 0.005*"law" + 0.005*"fire" + 0.005*"control" + 0.004*"don" + 0.004*"fbi"')
(4, '0.013*"image" + 0.009*"graphics" + 0.008*"jpeg" + 0.007*"file" + 0.006*"images" + 0.006*"data" + 0.006*"bit" + 0.006*"software" + 0.006*"ftp" + 0.006*"mail"')

Actual Results

These are the results I get from the converted native gensim model using lda_gensim.show_topics(num_topics=5, num_words=10). The word probabilities are all very low and not very distinctive, resulting in mostly incoherent topics:

(0, '0.000*"tribunal" + 0.000*"insruance" + 0.000*"damper" + 0.000*"unfurl" + 0.000*"urinalisys" + 0.000*"saturnation" + 0.000*"stupider" + 0.000*"improved" + 0.000*"waltons" + 0.000*"t_ng"')
(1, '0.000*"ott" + 0.000*"raved" + 0.000*"warped" + 0.000*"onesies" + 0.000*"speculating" + 0.000*"irrigate" + 0.000*"bodies" + 0.000*"inherant" + 0.000*"illustrations" + 0.000*"filler"')
(2, '0.000*"datasets" + 0.000*"addiction" + 0.000*"lr" + 0.000*"overturning" + 0.000*"supertrapp" + 0.000*"collision" + 0.000*"nl__" + 0.000*"someone" + 0.000*"switch" + 0.000*"pirate"')
(3, '0.000*"inbetweens" + 0.000*"hostname" + 0.000*"obsevatory" + 0.000*"dscharge" + 0.000*"ecclesiates" + 0.000*"drills" + 0.000*"ranching" + 0.000*"metz" + 0.000*"omnivorous" + 0.000*"normals"')
(4, '0.000*"uad" + 0.000*"undecidable" + 0.000*"eroded" + 0.000*"summarized" + 0.000*"reposition" + 0.000*"sttod" + 0.000*"sanctas" + 0.000*"broadest" + 0.000*"inception" + 0.000*"turntable"')

Versions

Windows-7-6.1.7601-SP1
Python 3.6.5 | packaged by conda-forge | (default, Apr 6 2018, 16:13:55) [MSC v.1900 64 bit (AMD64)]
NumPy 1.14.2
SciPy 1.0.1
gensim 3.4.0
FAST_VERSION 1
mallet version 2.0.8

Thanks in advance for any help! Cheers,
Wolfgang

The text was updated successfully, but these errors were encountered:

groceryheist · 2018-07-28T01:23:29Z

I am also having this, or a related, problem, with gensim 3.1.
I am trying now with gensim 3.5 and I will update if the issue still occurs.
I feel this bug should be fixable.

groceryheist · 2018-07-28T04:00:58Z

I tested with gensim 3.5 and encounter the same problem. This essentially makes malletmodel2ldamodel worthless.

menshikh-iv · 2018-07-30T12:13:49Z

@Wolfi-101 thanks for the report, issue reproduced with gensim==3.5.0 👍

mikeyearworth · 2018-08-10T13:40:56Z

Any news on this? I switched to using mallet for a study I'm doing but would still like to use pyLDAvis for consistency with previous work. I'm stuck with either
AttributeError: 'LdaMallet' object has no attribute 'inference'
or
gensim.models.wrappers.ldamallet.malletmodel2ldamodel()
returning random terms in the topics.
Using gensim 3.5 mallet 2.0.8

menshikh-iv · 2018-08-10T13:43:59Z

@mikeyearworth this looks unrelated to the current issue, can you provide full code example for reproducing your error please (with all needed data of course)

mikeyearworth · 2018-08-10T15:29:41Z

model = gensim.models.wrappers.LdaMallet('/opt/local/bin/mallet', corpus=mikeycorpus, num_topics=num_topics, id2word=mikeydictionary, workers=3)
data = pyLDAvis.gensim.prepare(model, mikeycorpus, mikeydictionary, mds='pcoa')
pyLDAvis.display(data)

`---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-1-83610b8776d0> in <module>()
    130 
    131 model = gensim.models.wrappers.LdaMallet('/opt/local/bin/mallet', corpus=mikeycorpus, num_topics=num_topics, id2word=mikeydictionary, workers=3)
--> 132 data = pyLDAvis.gensim.prepare(model, mikeycorpus, mikeydictionary, mds='pcoa')
    133 pyLDAvis.display(data)
    134 

/anaconda2/lib/python2.7/site-packages/pyLDAvis/gensim.pyc in prepare(topic_model, corpus, dictionary, doc_topic_dist, **kwargs)
    109     See `pyLDAvis.prepare` for **kwargs.
    110     """
--> 111     opts = fp.merge(_extract_data(topic_model, corpus, dictionary, doc_topic_dist), kwargs)
    112     return vis_prepare(**opts)

/anaconda2/lib/python2.7/site-packages/pyLDAvis/gensim.pyc in _extract_data(topic_model, corpus, dictionary, doc_topic_dists)
     40           gamma = topic_model.inference(corpus)
     41       else:
---> 42           gamma, _ = topic_model.inference(corpus)
     43       doc_topic_dists = gamma / gamma.sum(axis=1)[:, None]
     44 

AttributeError: 'LdaMallet' object has no attribute 'inference'`

Whereas

ldamodel = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(model)
data = pyLDAvis.gensim.prepare(ldamodel, mikeycorpus, mikeydictionary, mds='pcoa')
pyLDAvis.display(data)

generates

etc for all topics, compared to the actual model.

menshikh-iv · 2018-08-11T04:47:59Z

@mikeyearworth as I know, pyLDAVis support only LdaModel and LdaMulticore, not LdaMallet.

For viz LdaMallet you need to convert it to LdaModel using malletmodel2ldamodel first (and current thread about this function function, it doesn't works correctly).

groceryheist · 2018-08-11T18:25:14Z

Jeri Wieringa wrote a tutorial on using data pyLDAvis with mallett. You load the model directly from the state object and do some transformations. I was able to make this work by adapting her code.

http://jeriwieringa.com/2018/07/17/pyLDAviz-and-Mallet/#comment-4018495276

So here is a work around until malletmodel2ldamodel is fixed.

mikeyearworth · 2018-08-12T14:12:08Z

thanks @groceryheist (and Jeri Wieringa) that works fine.

Do you know if there is a way force the gensim wrapper for mallet to specify the state filename, or return it? mallet writes a new state file each run to an obscure location /var/folders/1x/93zy0_k93gj_xvrk4v_j96_m0000gp/T/XXXXXX_state.mallet.gz where XXXXXX is random each time.

groceryheist · 2018-08-12T16:29:45Z

The prefix parameter in the wrapper does this.

mikeyearworth · 2018-08-12T18:01:48Z

Great! Thanks @groceryheist.

idoDavid · 2018-12-05T13:42:01Z

@menshikh-iv is there any update on this? what is this bug fix priority? Thanks :)

horpto · 2018-12-06T01:40:48Z

I start to work on this issue.

`malletmodel2ldamodel` sets up expElogbeta attribute but LdaModel.show_topics uses inner not dirichleted state instead. And moreover LdaState and LdaModel were not synced.

Elpiro · 2018-12-07T15:50:58Z

Thanks @horpto it works !

`malletmodel2ldamodel` sets up expElogbeta attribute but LdaModel.show_topics uses inner not dirichleted state instead. And moreover LdaState and LdaModel were not synced.

* Fixes #2069: wrong malletmodel2ldamodel `malletmodel2ldamodel` sets up expElogbeta attribute but LdaModel.show_topics uses inner not dirichleted state instead. And moreover LdaState and LdaModel were not synced. * add test * fix linter * replace sklearn with gensim + use larger dataset & num topics (for more strict check) * remove sklearn import

kvvaldez · 2019-02-13T01:33:17Z

from gensim.models.ldamodel import LdaModel
import numpy

def ldaMalletConvertToldaGen(mallet_model):
model_gensim = LdaModel(
id2word=mallet_model.id2word, num_topics=mallet_model.num_topics,
alpha=mallet_model.alpha, eta=0, iterations=1000,
gamma_threshold=0.001,
dtype=numpy.float32
)
model_gensim.state.sstats[...] = mallet_model.wordtopics
model_gensim.sync_state()
return model_gensim
:)

horpto · 2019-02-13T02:28:21Z

Hi, @kvvaldez Do you want to add a remark or find the another bug ?

gladmortal · 2019-02-13T13:38:02Z

Hi @horpto , I can see this issue is closed but i am still facing the exact same issue , @Wolfi-101 reported.

I am using the latest gensim=3.7.1

After conversion i am getting vary rare keywords, here is my malletmodel2ldamodel conversion and pyLDAvis implementation.

ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=13, id2word=dictionary)
model = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(ldamallet)
model.save('ldamallet.gensim')

dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda_mallet = gensim.models.wrappers.LdaMallet.load('ldamallet.gensim')
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda_mallet, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

Here is the output from gensim original implementation:

horpto · 2019-02-15T23:28:06Z

Hi @gladmortal
Are steps with save and load necessary ? Does this error appear in the previous versions ?
Can you share a corpus ? I'll try to reproduce your error a bit later.

gladmortal · 2019-02-18T07:33:15Z

Hi @gladmortal
Are steps with save and load necessary ? Does this error appear in the previous versions ?
Can you share a corpus ? I'll try to reproduce your error a bit later.

I tried without save and load steps and the its giving the same issue, no change.
I am using gensim=3.7.1 didn't try any other version.
I'll try to share the part of corpus in a while.

paulmattheww · 2019-12-08T00:49:32Z

@kvvaldez
A version of your code worked for me"

def mallet_to_lda(mallet_model):
    model_gensim = LdaModel(
        id2word=mallet_model.id2word, num_topics=mallet_model.num_topics,
        alpha=mallet_model.alpha, eta=0, iterations=1000,
        gamma_threshold=0.001,
        dtype=np.float32
    )
    model_gensim.sync_state()
    model_gensim.state.sstats = mallet_model.wordtopics
    return model_gensim

deathrc · 2020-12-11T15:04:44Z

@kvvaldez
A version of your code worked for me"

def mallet_to_lda(mallet_model):
    model_gensim = LdaModel(
        id2word=mallet_model.id2word, num_topics=mallet_model.num_topics,
        alpha=mallet_model.alpha, eta=0, iterations=1000,
        gamma_threshold=0.001,
        dtype=np.float32
    )
    model_gensim.sync_state()
    model_gensim.state.sstats = mallet_model.wordtopics
    return model_gensim

This works for me, thanks :)

menshikh-iv changed the title ~~Incoherent topic word distributions after converting ldamallet model to native gensim lda model~~ Incoherent topic word distributions after malletmodel2ldamodel Jul 30, 2018

menshikh-iv added bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills labels Jul 30, 2018

menshikh-iv mentioned this issue Jul 31, 2018

No inference method for ldamallet (pyLDAVis gives an error) #1203

Closed

horpto mentioned this issue Dec 14, 2018

Fix malletmodel2ldamodel conversion #2288

Merged

menshikh-iv closed this as completed in #2288 Jan 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incoherent topic word distributions after `malletmodel2ldamodel` #2069

Incoherent topic word distributions after `malletmodel2ldamodel` #2069

Wolfi-101 commented May 28, 2018 •

edited by menshikh-iv

Loading

groceryheist commented Jul 28, 2018 •

edited

Loading

groceryheist commented Jul 28, 2018

menshikh-iv commented Jul 30, 2018 •

edited

Loading

mikeyearworth commented Aug 10, 2018

menshikh-iv commented Aug 10, 2018

mikeyearworth commented Aug 10, 2018 •

edited by menshikh-iv

Loading

menshikh-iv commented Aug 11, 2018

groceryheist commented Aug 11, 2018

mikeyearworth commented Aug 12, 2018

groceryheist commented Aug 12, 2018

mikeyearworth commented Aug 12, 2018

idoDavid commented Dec 5, 2018

horpto commented Dec 6, 2018

Elpiro commented Dec 7, 2018 •

edited

Loading

kvvaldez commented Feb 13, 2019 •

edited

Loading

horpto commented Feb 13, 2019

gladmortal commented Feb 13, 2019

horpto commented Feb 15, 2019

gladmortal commented Feb 18, 2019 •

edited

Loading

paulmattheww commented Dec 8, 2019

deathrc commented Dec 11, 2020

Incoherent topic word distributions after malletmodel2ldamodel #2069

Incoherent topic word distributions after malletmodel2ldamodel #2069

Comments

Wolfi-101 commented May 28, 2018 • edited by menshikh-iv Loading

Description

Steps/Code/Corpus to Reproduce

Expected Results

Actual Results

Versions

groceryheist commented Jul 28, 2018 • edited Loading

groceryheist commented Jul 28, 2018

menshikh-iv commented Jul 30, 2018 • edited Loading

mikeyearworth commented Aug 10, 2018

menshikh-iv commented Aug 10, 2018

mikeyearworth commented Aug 10, 2018 • edited by menshikh-iv Loading

menshikh-iv commented Aug 11, 2018

groceryheist commented Aug 11, 2018

mikeyearworth commented Aug 12, 2018

groceryheist commented Aug 12, 2018

mikeyearworth commented Aug 12, 2018

idoDavid commented Dec 5, 2018

horpto commented Dec 6, 2018

Elpiro commented Dec 7, 2018 • edited Loading

kvvaldez commented Feb 13, 2019 • edited Loading

horpto commented Feb 13, 2019

gladmortal commented Feb 13, 2019

horpto commented Feb 15, 2019

gladmortal commented Feb 18, 2019 • edited Loading

paulmattheww commented Dec 8, 2019

deathrc commented Dec 11, 2020

Incoherent topic word distributions after `malletmodel2ldamodel` #2069

Incoherent topic word distributions after `malletmodel2ldamodel` #2069

Wolfi-101 commented May 28, 2018 •

edited by menshikh-iv

Loading

groceryheist commented Jul 28, 2018 •

edited

Loading

menshikh-iv commented Jul 30, 2018 •

edited

Loading

mikeyearworth commented Aug 10, 2018 •

edited by menshikh-iv

Loading

Elpiro commented Dec 7, 2018 •

edited

Loading

kvvaldez commented Feb 13, 2019 •

edited

Loading

gladmortal commented Feb 18, 2019 •

edited

Loading