Skip to content

Commit 1a650d9

Browse files
committed
Include example of HMM-LDA dongwookim-ml#4.
The model seems to work in terms of log-likelihood, but the qualitative analysis does not show plausible results. Need more work on hyper-params or n-th order HMM.
1 parent 4f16071 commit 1a650d9

12 files changed

+497
-126
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Current implementations
1414
* Relational topic model (VI)
1515
* Exponential link function
1616
* Author-Topic model
17-
* HMM-LDA
17+
* [HMM-LDA](http://nbviewer.jupyter.org/github/arongdari/python-topic-model/blob/master/notebook/HMM_LDA_example.ipynb)
1818
* Discrete infinite logistic normal (DILN)
1919
* Variational inference
2020
* Supervised topic model

notebook/HMM_LDA_example.ipynb

Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Example of HMM-LDA "
8+
]
9+
},
10+
{
11+
"cell_type": "code",
12+
"execution_count": 1,
13+
"metadata": {
14+
"collapsed": false
15+
},
16+
"outputs": [],
17+
"source": [
18+
"import logging\n",
19+
"from ptm.nltk_corpus import get_reuters_token_list_by_sentence\n",
20+
"from ptm import HMM_LDA\n",
21+
"from ptm.utils import get_top_words\n",
22+
"\n",
23+
"logger = logging.getLogger('HMM_LDA')\n",
24+
"logger.propagate=False"
25+
]
26+
},
27+
{
28+
"cell_type": "markdown",
29+
"metadata": {},
30+
"source": [
31+
"## Read corpus"
32+
]
33+
},
34+
{
35+
"cell_type": "markdown",
36+
"metadata": {},
37+
"source": [
38+
"`corpus` is a nested list of documents, sentences, and word tokens, respectively."
39+
]
40+
},
41+
{
42+
"cell_type": "code",
43+
"execution_count": 2,
44+
"metadata": {
45+
"collapsed": false
46+
},
47+
"outputs": [
48+
{
49+
"name": "stdout",
50+
"output_type": "stream",
51+
"text": [
52+
"Vocabulary size 3851\n"
53+
]
54+
}
55+
],
56+
"source": [
57+
"n_docs = 1000\n",
58+
"voca, corpus = get_reuters_token_list_by_sentence(num_doc=n_docs)\n",
59+
"print('Vocabulary size', len(voca))"
60+
]
61+
},
62+
{
63+
"cell_type": "markdown",
64+
"metadata": {},
65+
"source": [
66+
"## Training HMM LDA"
67+
]
68+
},
69+
{
70+
"cell_type": "code",
71+
"execution_count": 3,
72+
"metadata": {
73+
"collapsed": false
74+
},
75+
"outputs": [],
76+
"source": [
77+
"n_docs = len(corpus)\n",
78+
"n_voca = len(voca)\n",
79+
"n_topic = 50\n",
80+
"n_class = 20\n",
81+
"max_iter = 100\n",
82+
"alpha = 0.1\n",
83+
"beta = 0.01\n",
84+
"gamma = 0.1\n",
85+
"eta = 0.1\n",
86+
"model = HMM_LDA(n_docs, n_voca, n_topic, n_class, alpha=alpha, beta=beta, gamma=gamma, eta=eta, verbose=False)\n",
87+
"model.fit(corpus, max_iter=max_iter)"
88+
]
89+
},
90+
{
91+
"cell_type": "markdown",
92+
"metadata": {},
93+
"source": [
94+
"## Print Top 10 words for each class and topic"
95+
]
96+
},
97+
{
98+
"cell_type": "code",
99+
"execution_count": 4,
100+
"metadata": {
101+
"collapsed": false
102+
},
103+
"outputs": [
104+
{
105+
"name": "stdout",
106+
"output_type": "stream",
107+
"text": [
108+
"Topic 0 : will,on,its,must,throughout,same,by,traditional,loss,background\n",
109+
"Topic 1 : future,should,are,charge,higher,sulphur,first,an,company,letter\n",
110+
"Topic 2 : ready,same,be,basis,it,will,for,at,registered,capital\n",
111+
"Topic 3 : alone,great,specialty,would,unreasonable,falling,say,formed,top,declined\n",
112+
"Topic 4 : offer,do,although,on,over,would,much,by,fiscal,objective\n",
113+
"Topic 5 : barring,did,bearing,may,but,its,narrow,target,leading,same\n",
114+
"Topic 6 : for,two,meeting,may,still,at,six,whose,become,marked\n",
115+
"Topic 7 : stimulate,each,under,satisfied,at,transition,distribution,activity,for,provision\n",
116+
"Topic 8 : is,difficulty,effect,top,from,nine,price,deficit,agreed,only\n",
117+
"Topic 9 : for,country,pressure,increasing,will,government,its,quietly,nil,report\n",
118+
"Topic 10 : petroleum,per,expectation,pollard,weight,textile,from,cocoa,absorbing,remainder\n",
119+
"Topic 11 : should,but,set,shipment,much,term,same,be,practice,its\n",
120+
"Topic 12 : offer,present,at,this,they,help,name,an,time,show\n",
121+
"Topic 13 : would,rating,current,landing,year,long,market,after,when,its\n",
122+
"Topic 14 : six,goods,national,were,commodity,massive,use,merge,confirmed,days\n",
123+
"Topic 15 : trade,it,agreement,industry,those,town,from,we,number,with\n",
124+
"Topic 16 : other,year,worked,be,give,it,ago,are,proposal,progress\n",
125+
"Topic 17 : speculation,it,deficit,its,this,despite,an,up,large,government\n",
126+
"Topic 18 : cash,corn,over,preferred,with,about,still,least,association,overseas\n",
127+
"Topic 19 : trade,will,is,accrual,consider,similar,pressure,chairman,parcel,with\n",
128+
"Topic 20 : trade,current,fault,or,group,week,this,an,half,one\n",
129+
"Topic 21 : fiscal,for,turned,tone,similar,average,annual,it,closed,why\n",
130+
"Topic 22 : weak,raising,special,contract,profit,by,while,he,would,block\n",
131+
"Topic 23 : following,growth,crude,up,an,leading,business,fiscal,floating,impact\n",
132+
"Topic 24 : given,another,reserve,contract,harvest,ahead,an,textile,message,dividend\n",
133+
"Topic 25 : today,put,but,debt,market,seen,interest,concern,franc,week\n",
134+
"Topic 26 : no,from,six,market,particularly,earn,one,measured,tender,suspension\n",
135+
"Topic 27 : most,new,are,percentage,definitive,adequate,bread,business,minister,us\n",
136+
"Topic 28 : trade,central,chairman,beginning,last,had,condition,when,subject,added\n",
137+
"Topic 29 : certain,period,be,nil,end,issue,quarter,billion,vague,investigatory\n",
138+
"Topic 30 : expire,market,underground,it,reaction,sharply,together,nil,everything,government\n",
139+
"Topic 31 : exercisable,its,federal,growth,both,would,last,long,much,year\n",
140+
"Topic 32 : state,unchanged,quarter,increase,want,several,rolled,we,if,for\n",
141+
"Topic 33 : trade,with,being,more,is,total,principally,likely,number,margin\n",
142+
"Topic 34 : posted,this,share,next,subject,dealer,executive,two,interview,which\n",
143+
"Topic 35 : rise,group,friendly,be,sale,it,also,bank,for,or\n",
144+
"Topic 36 : based,premium,most,from,number,last,had,fourth,make,also\n",
145+
"Topic 37 : yen,stability,they,offering,billion,week,cut,under,trading,this\n",
146+
"Topic 38 : nil,about,bill,re,bank,chairman,be,strong,false,closed\n",
147+
"Topic 39 : trade,year,operating,line,say,equal,approach,price,search,strength\n",
148+
"Topic 40 : cake,be,move,here,budget,were,should,development,shortly,by\n",
149+
"Topic 41 : outstanding,exploration,its,government,number,for,all,account,monthly,week\n",
150+
"Topic 42 : industrial,them,short,its,loss,be,it,from,concern,each\n",
151+
"Topic 43 : had,responsible,an,unit,we,situation,well,ready,field,not\n",
152+
"Topic 44 : settle,trading,see,its,from,much,output,interbank,government,for\n",
153+
"Topic 45 : trade,situation,because,cost,priced,but,as,would,its,urgency\n",
154+
"Topic 46 : am,major,sugar,pose,t,by,memorandum,dropping,division,were\n",
155+
"Topic 47 : who,two,spokesman,cash,loss,kept,it,month,equity,daily\n",
156+
"Topic 48 : saw,he,gallon,would,we,sale,season,for,year,not\n",
157+
"Topic 49 : permit,be,billion,they,by,concerned,forward,overall,if,from\n"
158+
]
159+
}
160+
],
161+
"source": [
162+
"for ti in range(n_topic):\n",
163+
" top_words = get_top_words(model.TW, voca, ti, n_words=10)\n",
164+
" print('Topic', ti ,': ', ','.join(top_words))"
165+
]
166+
},
167+
{
168+
"cell_type": "code",
169+
"execution_count": 5,
170+
"metadata": {
171+
"collapsed": false
172+
},
173+
"outputs": [
174+
{
175+
"name": "stdout",
176+
"output_type": "stream",
177+
"text": [
178+
"Class 1 : were,on,per,be,it,will,an,is,year,company\n",
179+
"Class 2 : at,was,have,billion,not,an,is,will,it,be\n",
180+
"Class 3 : trade,will,on,is,loss,be,have,it,from,this\n",
181+
"Class 4 : by,also,would,for,will,were,this,have,are,from\n",
182+
"Class 5 : the,about,be,on,year,company,would,by,with,loss\n",
183+
"Class 6 : the,he,billion,is,be,it,will,an,not,at\n",
184+
"Class 7 : the,he,from,were,an,loss,be,it,will,nil\n",
185+
"Class 8 : one,with,for,company,an,nil,billion,it,be,loss\n",
186+
"Class 9 : the,be,as,was,not,will,it,nil,at,an\n",
187+
"Class 10 : on,last,for,at,company,will,it,billion,be,by\n",
188+
"Class 11 : the,year,for,would,from,was,be,it,will,an\n",
189+
"Class 12 : or,are,it,will,for,not,at,billion,by,its\n",
190+
"Class 13 : as,is,not,company,were,will,it,be,loss,at\n",
191+
"Class 14 : was,its,it,be,quarter,for,billion,from,would,on\n",
192+
"Class 15 : market,last,is,with,on,would,share,by,billion,be\n",
193+
"Class 16 : last,on,an,its,loss,be,it,will,company,is\n",
194+
"Class 17 : the,trade,this,be,was,it,will,company,for,not\n",
195+
"Class 18 : the,last,will,from,billion,an,loss,be,it,its\n",
196+
"Class 19 : the,of,to,in,said,and,a,for,s,on\n"
197+
]
198+
}
199+
],
200+
"source": [
201+
"for ci in range(1, n_class):\n",
202+
" top_words = get_top_words(model.CW, voca, ci, n_words=10)\n",
203+
" print('Class', ci ,': ', ','.join(top_words))"
204+
]
205+
},
206+
{
207+
"cell_type": "markdown",
208+
"metadata": {},
209+
"source": [
210+
"**Function words belong to classes and content words belong to topics.**\n",
211+
"\n",
212+
"In this example, function words are not very well divided by their roles. As in the original paper, fine-tuning, sampling hyper-parameters or n-th order Markovian assumption may help to improve the performance."
213+
]
214+
}
215+
],
216+
"metadata": {
217+
"kernelspec": {
218+
"display_name": "Python 3",
219+
"language": "python",
220+
"name": "python3"
221+
},
222+
"language_info": {
223+
"codemirror_mode": {
224+
"name": "ipython",
225+
"version": 3
226+
},
227+
"file_extension": ".py",
228+
"mimetype": "text/x-python",
229+
"name": "python",
230+
"nbconvert_exporter": "python",
231+
"pygments_lexer": "ipython3",
232+
"version": "3.4.3"
233+
},
234+
"toc": {
235+
"toc_cell": true,
236+
"toc_number_sections": true,
237+
"toc_threshold": 4,
238+
"toc_window_display": false
239+
}
240+
},
241+
"nbformat": 4,
242+
"nbformat_minor": 0
243+
}

notebook/LDA_example.ipynb

Lines changed: 25 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,13 @@
11
{
22
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Table of Contents\n",
8+
" <p><div class=\"lev1\"><a href=\"#Example-of-GibbsLDA-and-vbLDA\"><span class=\"toc-item-num\">1 - </span>Example of GibbsLDA and vbLDA</a></div><div class=\"lev2\"><a href=\"#Loading-Reuter-corpus-from-NLTK\"><span class=\"toc-item-num\">1.1 - </span>Loading Reuter corpus from NLTK</a></div><div class=\"lev2\"><a href=\"#Inferencen-through-the-Gibbs-sampling\"><span class=\"toc-item-num\">1.2 - </span>Inferencen through the Gibbs sampling</a></div><div class=\"lev3\"><a href=\"#Print-top-10-probability-words-for-each-topic\"><span class=\"toc-item-num\">1.2.1 - </span>Print top 10 probability words for each topic</a></div><div class=\"lev2\"><a href=\"#Inferencen-through-the-Variational-Bayes\"><span class=\"toc-item-num\">1.3 - </span>Inferencen through the Variational Bayes</a></div><div class=\"lev3\"><a href=\"#Print-top-10-probability-words-for-each-topic\"><span class=\"toc-item-num\">1.3.1 - </span>Print top 10 probability words for each topic</a></div>"
9+
]
10+
},
311
{
412
"cell_type": "markdown",
513
"metadata": {},
@@ -20,7 +28,7 @@
2028
"cell_type": "code",
2129
"execution_count": 1,
2230
"metadata": {
23-
"collapsed": false
31+
"collapsed": true
2432
},
2533
"outputs": [],
2634
"source": [
@@ -29,7 +37,7 @@
2937
"import numpy as np\n",
3038
"from ptm import GibbsLDA\n",
3139
"from ptm import vbLDA\n",
32-
"from ptm.nltk_corpus import get_reuters_cnt_ids\n",
40+
"from ptm.nltk_corpus import get_reuters_ids_cnt\n",
3341
"from ptm.utils import convert_cnt_to_list, get_top_words"
3442
]
3543
},
@@ -64,7 +72,7 @@
6472
],
6573
"source": [
6674
"n_doc = 1000\n",
67-
"voca, doc_ids, doc_cnt = get_reuters_cnt_ids(num_doc=n_doc, max_voca=10000)\n",
75+
"voca, doc_ids, doc_cnt = get_reuters_ids_cnt(num_doc=n_doc, max_voca=10000)\n",
6876
"docs = convert_cnt_to_list(doc_ids, doc_cnt)\n",
6977
"n_voca = len(voca)\n",
7078
"print('Vocabulary size:%d' % n_voca)"
@@ -204,9 +212,7 @@
204212
},
205213
{
206214
"cell_type": "markdown",
207-
"metadata": {
208-
"collapsed": true
209-
},
215+
"metadata": {},
210216
"source": [
211217
"### Print top 10 probability words for each topic"
212218
]
@@ -372,9 +378,7 @@
372378
},
373379
{
374380
"cell_type": "markdown",
375-
"metadata": {
376-
"collapsed": true
377-
},
381+
"metadata": {},
378382
"source": [
379383
"### Print top 10 probability words for each topic"
380384
]
@@ -436,6 +440,18 @@
436440
"nbconvert_exporter": "python",
437441
"pygments_lexer": "ipython3",
438442
"version": "3.4.3"
443+
},
444+
"toc": {
445+
"toc_cell": true,
446+
"toc_number_sections": true,
447+
"toc_threshold": 4,
448+
"toc_window_display": true
449+
},
450+
"toc_position": {
451+
"left": "1120px",
452+
"right": "20px",
453+
"top": "120px",
454+
"width": "299px"
439455
}
440456
},
441457
"nbformat": 4,

notebook/SupervisedTopicModel_example.ipynb

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,15 @@
11
{
22
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {
6+
"toc": "true"
7+
},
8+
"source": [
9+
"# Table of Contents\n",
10+
" <p><div class=\"lev1\"><a href=\"#Supervised-Topic-Model\"><span class=\"toc-item-num\">1 - </span>Supervised Topic Model</a></div><div class=\"lev2\"><a href=\"#Read-and-tokenize-moview-review-dataset\"><span class=\"toc-item-num\">1.1 - </span>Read and tokenize moview review dataset</a></div><div class=\"lev2\"><a href=\"#Infer-topics-with-SupervisedLDA\"><span class=\"toc-item-num\">1.2 - </span>Infer topics with SupervisedLDA</a></div>"
11+
]
12+
},
313
{
414
"cell_type": "markdown",
515
"metadata": {},
@@ -42,7 +52,7 @@
4252
"cell_type": "markdown",
4353
"metadata": {},
4454
"source": [
45-
"Read and tokenize moview review dataset"
55+
"## Read and tokenize moview review dataset"
4656
]
4757
},
4858
{
@@ -110,7 +120,7 @@
110120
"cell_type": "markdown",
111121
"metadata": {},
112122
"source": [
113-
"Infer topics with SupervisedLDA"
123+
"## Infer topics with SupervisedLDA"
114124
]
115125
},
116126
{
@@ -309,7 +319,7 @@
309319
"cell_type": "markdown",
310320
"metadata": {},
311321
"source": [
312-
"The review about one movie, so the topics does not seem to be clearly distinguishable. At least, however, the most negative topics contain words such as `bad`, `never`, and `dull`. And the most positive topics contain word like `great`, `best`, and `masterpeice`."
322+
"**The review about one movie, so the topics does not seem to be clearly distinguishable. At least, however, the most negative topics contain words such as `bad`, `never`, and `dull`. And the most positive topics contain word like `great`, `best`, and `masterpeice`.**"
313323
]
314324
},
315325
{
@@ -339,6 +349,12 @@
339349
"nbconvert_exporter": "python",
340350
"pygments_lexer": "ipython3",
341351
"version": "3.4.3"
352+
},
353+
"toc": {
354+
"toc_cell": true,
355+
"toc_number_sections": true,
356+
"toc_threshold": 4,
357+
"toc_window_display": true
342358
}
343359
},
344360
"nbformat": 4,

ptm/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@
44
from .ctm import CorrelatedTopicModel
55
from .rtm import RelationalTopicModel
66
from .diln import DILN
7-
7+
from .hmm_lda import HMM_LDA

0 commit comments

Comments
 (0)