Skip to content

centre-for-humanities-computing/tweetopic

Repository files navigation

Logo with text

tweetopic: Blazing Fast Topic modelling for Short Texts

PyPI version pip downloads python version Code style: black
NumPy SciPy scikit-learn

⚡ Blazing Fast topic modelling over short texts utilizing the power of 🔢 Numpy and 🐍 Numba.

Features

  • Fast ⚡
  • Scalable 💥
  • High consistency and coherence 🎯
  • High quality topics 🔥
  • Easy visualization and inspection 👀
  • Full scikit-learn compatibility 🔩

🛠 Installation

Install from PyPI:

pip install tweetopic

If you intend to use the visualization features of PyLDAvis, install the package with optional dependencies:

pip install tweetopic[viz]

👩‍💻 Usage (documentation)

For easy topic modelling, tweetopic provides you the TopicPipeline class:

from tweetopic import TopicPipeline, DMM
from sklearn.feature_extraction.text import CountVectorizer

# Creating a vectorizer for extracting document-term matrix from the
# text corpus.
vectorizer = CountVectorizer(min_df=15, max_df=0.1)

# Creating a Dirichlet Multinomial Mixture Model with 30 components
dmm = DMM(n_clusters=30, n_iterations=100, alpha=0.1, beta=0.1)

# Creating topic pipeline
pipeline = TopicPipeline(vectorizer, dmm)

You may fit the model with a stream of short texts:

pipeline.fit(texts)

To examine the structure of the topics you can either look at the most frequently occuring words:

pipeline.top_words(top_n=3)
-----------------------------------------------------------------

[
    {'vaccine': 1011.0, 'coronavirus': 428.0, 'vaccines': 396.0},
    {'afghanistan': 586.0, 'taliban': 509.0, 'says': 464.0},
    {'man': 362.0, 'prison': 310.0, 'year': 288.0},
    {'police': 567.0, 'floyd': 444.0, 'trial': 393.0},
    {'media': 331.0, 'twitter': 321.0, 'facebook': 306.0},
    ...
    {'pandemic': 432.0, 'year': 427.0, 'new': 422.0},
    {'election': 759.0, 'trump': 573.0, 'republican': 527.0},
    {'women': 91.0, 'heard': 84.0, 'depp': 76.0}
]

Or use rich visualizations provided by pyLDAvis:

pipeline.visualize(texts)

PyLDAvis visualization

Note: You must install optional dependencies if you intend to use pyLDAvis

🎓 References

  • Yin, J., & Wang, J. (2014). A Dirichlet Multinomial Mixture Model-Based Approach for Short Text Clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 233–242). Association for Computing Machinery.