-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incoherent topic word distributions after malletmodel2ldamodel
#2069
Comments
I am also having this, or a related, problem, with gensim 3.1. |
I tested with gensim 3.5 and encounter the same problem. This essentially makes |
malletmodel2ldamodel
@Wolfi-101 thanks for the report, issue reproduced with |
Any news on this? I switched to using mallet for a study I'm doing but would still like to use pyLDAvis for consistency with previous work. I'm stuck with either |
@mikeyearworth this looks unrelated to the current issue, can you provide full code example for reproducing your error please (with all needed data of course) |
model = gensim.models.wrappers.LdaMallet('/opt/local/bin/mallet', corpus=mikeycorpus, num_topics=num_topics, id2word=mikeydictionary, workers=3)
data = pyLDAvis.gensim.prepare(model, mikeycorpus, mikeydictionary, mds='pcoa')
pyLDAvis.display(data)
`---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-1-83610b8776d0> in <module>()
130
131 model = gensim.models.wrappers.LdaMallet('/opt/local/bin/mallet', corpus=mikeycorpus, num_topics=num_topics, id2word=mikeydictionary, workers=3)
--> 132 data = pyLDAvis.gensim.prepare(model, mikeycorpus, mikeydictionary, mds='pcoa')
133 pyLDAvis.display(data)
134
/anaconda2/lib/python2.7/site-packages/pyLDAvis/gensim.pyc in prepare(topic_model, corpus, dictionary, doc_topic_dist, **kwargs)
109 See `pyLDAvis.prepare` for **kwargs.
110 """
--> 111 opts = fp.merge(_extract_data(topic_model, corpus, dictionary, doc_topic_dist), kwargs)
112 return vis_prepare(**opts)
/anaconda2/lib/python2.7/site-packages/pyLDAvis/gensim.pyc in _extract_data(topic_model, corpus, dictionary, doc_topic_dists)
40 gamma = topic_model.inference(corpus)
41 else:
---> 42 gamma, _ = topic_model.inference(corpus)
43 doc_topic_dists = gamma / gamma.sum(axis=1)[:, None]
44
AttributeError: 'LdaMallet' object has no attribute 'inference'` Whereas ldamodel = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(model)
data = pyLDAvis.gensim.prepare(ldamodel, mikeycorpus, mikeydictionary, mds='pcoa')
pyLDAvis.display(data) etc for all topics, compared to the actual model. |
@mikeyearworth as I know, For viz |
Jeri Wieringa wrote a tutorial on using data pyLDAvis with mallett. You load the model directly from the state object and do some transformations. I was able to make this work by adapting her code. http://jeriwieringa.com/2018/07/17/pyLDAviz-and-Mallet/#comment-4018495276 So here is a work around until malletmodel2ldamodel is fixed. |
thanks @groceryheist (and Jeri Wieringa) that works fine. Do you know if there is a way force the gensim wrapper for mallet to specify the state filename, or return it? mallet writes a new state file each run to an obscure location /var/folders/1x/93zy0_k93gj_xvrk4v_j96_m0000gp/T/XXXXXX_state.mallet.gz where XXXXXX is random each time. |
The prefix parameter in the wrapper does this. |
Great! Thanks @groceryheist. |
@menshikh-iv is there any update on this? what is this bug fix priority? Thanks :) |
I start to work on this issue. |
`malletmodel2ldamodel` sets up expElogbeta attribute but LdaModel.show_topics uses inner not dirichleted state instead. And moreover LdaState and LdaModel were not synced.
Thanks @horpto it works ! |
`malletmodel2ldamodel` sets up expElogbeta attribute but LdaModel.show_topics uses inner not dirichleted state instead. And moreover LdaState and LdaModel were not synced.
* Fixes #2069: wrong malletmodel2ldamodel `malletmodel2ldamodel` sets up expElogbeta attribute but LdaModel.show_topics uses inner not dirichleted state instead. And moreover LdaState and LdaModel were not synced. * add test * fix linter * replace sklearn with gensim + use larger dataset & num topics (for more strict check) * remove sklearn import
from gensim.models.ldamodel import LdaModel def ldaMalletConvertToldaGen(mallet_model): |
Hi, @kvvaldez Do you want to add a remark or find the another bug ? |
Hi @horpto , I can see this issue is closed but i am still facing the exact same issue , @Wolfi-101 reported. I am using the latest After conversion i am getting vary rare keywords, here is my
Here is the output from gensim original implementation: |
Hi @gladmortal |
|
@kvvaldez
|
This works for me, thanks :) |
Hi everyone,
first off many thanks for providing such an awesome module! I am using
gensim
to do topic modeling with LDA and encountered the following bug/issue. I have already read about it in the mailing list, but apparently no issue has been created on Github.Description
After training an LDA model with the gensim mallet wrapper I converted the model to a native gensim LDA model via the
malletmodel2ldamodel
function provided with the wrapper. Before and after the conversion the topic word distributions are quite different. The ldamallet version returns comprehensible topics with sensible weights, whereas the topic word distribution after conversion is nearly uniform, leading to topics without a clear focus.I am assuming that the resulting topics are supposed to be at least somewhat similar before and after conversion. Am I doing something wrong? What could be causing this behaviour?
Steps/Code/Corpus to Reproduce
Expected Results
These are the results I get from the mallet wrapper using
lda_mallet.show_topics(num_topics=5, num_words=10)
. Those are what one would expect considering the chosen categories from 20newsgroups:Actual Results
These are the results I get from the converted native gensim model using
lda_gensim.show_topics(num_topics=5, num_words=10)
. The word probabilities are all very low and not very distinctive, resulting in mostly incoherent topics:Versions
Thanks in advance for any help! Cheers,
Wolfgang
The text was updated successfully, but these errors were encountered: