Code for HKUST MATH 5472 Final Project
Python implementation of LDA based on Latent Dirichlet Allocation
from preprocess import *
from main import *
corpus = preprocessing(M=200)
preprocess.py
provides a text-preprocessing for American Press corpus, returns a list where each element represents a document coded by {0,1}. Preprocess the first M documents in AP corpus.
alpha, beta = LDA.parameter_estimation(corpus, k=10, tol=1e-6, max_iter=100)
LDA.parameter_estimation
performs variantial inference EM to estimate Dirichlet parameter alpha, and word probability beta. Number of topics k should be given.
Check the notebooks sim_data.ipynb
and ap_modeling.ipynb
to play the examples in report lda.pdf.