⚡ Blazing Fast topic modelling over short texts utilizing the power of 🔢 Numpy and 🐍 Numba.
- Fast ⚡
- Scalable 💥
- High consistency and coherence 🎯
- High quality topics 🔥
- Easy visualization and inspection 👀
- Full scikit-learn compatibility 🔩
Install from PyPI:
pip install tweetopic
If you intend to use the visualization features of PyLDAvis, install the package with optional dependencies:
pip install tweetopic[viz]
👩💻 Usage (documentation)
For easy topic modelling, tweetopic provides you the TopicPipeline class:
from tweetopic import TopicPipeline, DMM
from sklearn.feature_extraction.text import CountVectorizer
# Creating a vectorizer for extracting document-term matrix from the
# text corpus.
vectorizer = CountVectorizer(min_df=15, max_df=0.1)
# Creating a Dirichlet Multinomial Mixture Model with 30 components
dmm = DMM(n_clusters=30, n_iterations=100, alpha=0.1, beta=0.1)
# Creating topic pipeline
pipeline = TopicPipeline(vectorizer, dmm)
You may fit the model with a stream of short texts:
pipeline.fit(texts)
To examine the structure of the topics you can either look at the most frequently occuring words:
pipeline.top_words(top_n=3)
-----------------------------------------------------------------
[
{'vaccine': 1011.0, 'coronavirus': 428.0, 'vaccines': 396.0},
{'afghanistan': 586.0, 'taliban': 509.0, 'says': 464.0},
{'man': 362.0, 'prison': 310.0, 'year': 288.0},
{'police': 567.0, 'floyd': 444.0, 'trial': 393.0},
{'media': 331.0, 'twitter': 321.0, 'facebook': 306.0},
...
{'pandemic': 432.0, 'year': 427.0, 'new': 422.0},
{'election': 759.0, 'trump': 573.0, 'republican': 527.0},
{'women': 91.0, 'heard': 84.0, 'depp': 76.0}
]
Or use rich visualizations provided by pyLDAvis:
pipeline.visualize(texts)
Note: You must install optional dependencies if you intend to use pyLDAvis
- Yin, J., & Wang, J. (2014). A Dirichlet Multinomial Mixture Model-Based Approach for Short Text Clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 233–242). Association for Computing Machinery.