This is a Python 2 implementation of Geometric Dirichlet Means algorithm for topic inference (M. Yurochkin, X. Nguyen NIPS 2016) and Conic Scan-and-Cover algorithms for nonparametric topic modeling (M. Yurochkin, A. Guha, X. Nguyen NIPS 2017). Code written by Mikhail Yurochkin.
This is a simple demonstration of GDM, CoSAC and Gibbs sampler (from lda package) on simulated data. More extensive guide is in preparation.
all_func.py Implements data simulation according to LDA model, GDM algorithm and projection estimate of topic proportions
geom_tm.py Implements CoSAC algorithm for sparse document-term matrix and wraps it as scikit-learn class
tester_CoSAC.py contains a simulated example
Implementation is designed to be used in the interactive mode (e.g. Python IDE like Spyder).
gdm(wdfn, K, ncores=-1)
wdfn:
K: number of topics to fit
ncores: CPUs to use for k-means
Returns: topic estimates
geom_tm(delta=0.4, prop_discard=0.5, prop_n=0.01, verbose=False)
Parameters:
delta: cosine cone radius
prop_discard: quantile to compute
prop_n: proportion of data to be used as outlier threshold
verbose: if True, plots as in Figure 2 will be printed
Methods:
fit_a(data, cent)
data: sparse
cent: data mean
Returns: a_betas_: topic estimates from Algorithm 2 without spherical k-means step K_: estimated number of topics
fit_sph(data, cent, init=None, it=10)
data: sparse
cent: data mean
init, it: if None and fit_a was run, will complete Algorithm 2 with \emph{it} spherical k-means iterations
Returns: sph_betas_: updated topics sph_clust_: cluster assignments
fit_all(data, cent, it=5)
Full run of Algorithm 2 with \emph{it} spherical k-means post processing iterations