Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.12 #668

Merged
merged 30 commits into from
Sep 11, 2022
Merged

v0.12 #668

merged 30 commits into from
Sep 11, 2022

Conversation

MaartenGr
Copy link
Owner

@MaartenGr MaartenGr commented Aug 10, 2022

Highlights:

  • Online/incremental topic modeling with .partial_fit
  • Expose c-TF-IDF model for customization with bertopic.vectorizers.ClassTfidfTransformer
    • Several parameters were added to potentially improve representations:
      • bm25_weighting
      • reduce_frequent_words
  • Expose attributes for easier access to internal data
  • Major changes to the Algorithm page of the documentation, which now contains three overviews of the algorithm:
  • Added an example of combining BERTopic with KeyBERT
  • Added many tests with the intention of making development a bit more stable

Online/Incremental topic modeling:

from sklearn.datasets import fetch_20newsgroups
from sklearn.cluster import MiniBatchKMeans
from sklearn.decomposition import IncrementalPCA
from bertopic.vectorizers import OnlineCountVectorizer
from bertopic import BERTopic

# Prepare documents
all_docs = fetch_20newsgroups(subset=subset,  remove=('headers', 'footers', 'quotes'))["data"]
doc_chunks = [all_docs[i:i+1000] for i in range(0, len(all_docs), 1000)]

# Prepare sub-models that support online learning
umap_model = IncrementalPCA(n_components=5)
cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)
vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01)

topic_model = BERTopic(umap_model=umap_model,
                       hdbscan_model=cluster_model,
                       vectorizer_model=vectorizer_model)

# Incrementally fit the topic model by training on 1000 documents at a time
for docs in doc_chunks:
    topic_model.partial_fit(docs)

Only the topics for the most recent batch of documents are tracked. If you want to be using online topic modeling, not for a streaming setting but merely for low-memory use cases, then it is advised to also update the .topics_ attribute as variations such as hierarchical topic modeling will not work afterward:

# Incrementally fit the topic model by training on 1000 documents at a time and track the topics in each iteration
topics = []
for docs in doc_chunks:
    topic_model.partial_fit(docs)
    topics.extend(topic_model.topics_)

topic_model.topics_ = topics

c-TF-IDF model:

from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer()

ctfidf_model = ClassTfidfTransformer(bm25_weighting=True)
topic_model = BERTopic(ctfidf_model =ctfidf_model )

Attributes:

Attribute Description
.topics_ The topics that are generated for each document after training or updating the topic model.
.probabilities_ The probabilities that are generated for each document if HDBSCAN is used.
topic_sizes_ The size of each topic
topic_mapper_ A class for tracking topics and their mappings anytime they are merged/reduced.
topic_representations_ The top n terms per topic and their respective c-TF-IDF values.
c_tf_idf_ The topic-term matrix as calculated through c-TF-IDF.
topic_labels_ The default labels for each topic.
custom_labels_ Custom labels for each topic.
topic_embeddings_ The embeddings for each topic.
representative_docs_ The representative documents for each topic.

Fixes:

@MaartenGr MaartenGr mentioned this pull request Sep 8, 2022
@MaartenGr MaartenGr merged commit 09c1732 into master Sep 11, 2022
@MaartenGr MaartenGr deleted the v0.12 branch May 4, 2023 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant