v0.12 #668

MaartenGr · 2022-08-10T08:26:13Z

Highlights:

Online/incremental topic modeling with .partial_fit
Expose c-TF-IDF model for customization with bertopic.vectorizers.ClassTfidfTransformer
- Several parameters were added to potentially improve representations:
  - bm25_weighting
  - reduce_frequent_words
Expose attributes for easier access to internal data
Major changes to the Algorithm page of the documentation, which now contains three overviews of the algorithm:
Added an example of combining BERTopic with KeyBERT
Added many tests with the intention of making development a bit more stable

Online/Incremental topic modeling:

from sklearn.datasets import fetch_20newsgroups
from sklearn.cluster import MiniBatchKMeans
from sklearn.decomposition import IncrementalPCA
from bertopic.vectorizers import OnlineCountVectorizer
from bertopic import BERTopic

# Prepare documents
all_docs = fetch_20newsgroups(subset=subset,  remove=('headers', 'footers', 'quotes'))["data"]
doc_chunks = [all_docs[i:i+1000] for i in range(0, len(all_docs), 1000)]

# Prepare sub-models that support online learning
umap_model = IncrementalPCA(n_components=5)
cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)
vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01)

topic_model = BERTopic(umap_model=umap_model,
                       hdbscan_model=cluster_model,
                       vectorizer_model=vectorizer_model)

# Incrementally fit the topic model by training on 1000 documents at a time
for docs in doc_chunks:
    topic_model.partial_fit(docs)

Only the topics for the most recent batch of documents are tracked. If you want to be using online topic modeling, not for a streaming setting but merely for low-memory use cases, then it is advised to also update the .topics_ attribute as variations such as hierarchical topic modeling will not work afterward:

# Incrementally fit the topic model by training on 1000 documents at a time and track the topics in each iteration
topics = []
for docs in doc_chunks:
    topic_model.partial_fit(docs)
    topics.extend(topic_model.topics_)

topic_model.topics_ = topics

c-TF-IDF model:

from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer()

ctfidf_model = ClassTfidfTransformer(bm25_weighting=True)
topic_model = BERTopic(ctfidf_model =ctfidf_model )

Attributes:

Attribute	Description
.topics_	The topics that are generated for each document after training or updating the topic model.
.probabilities_	The probabilities that are generated for each document if HDBSCAN is used.
topic_sizes_	The size of each topic
topic_mapper_	A class for tracking topics and their mappings anytime they are merged/reduced.
topic_representations_	The top n terms per topic and their respective c-TF-IDF values.
c_tf_idf_	The topic-term matrix as calculated through c-TF-IDF.
topic_labels_	The default labels for each topic.
custom_labels_	Custom labels for each topic.
topic_embeddings_	The embeddings for each topic.
representative_docs_	The representative documents for each topic.

Fixes:

…o many functions

… output

…ction levels

…and some additional changes

…0.12

MaartenGr added 7 commits August 4, 2022 09:39

Added attribute .labels_ and removed the necessity of passing opics t…

90f56e6

…o many functions

Fix #632 and #648

a6bbf49

Expose attributes following scikit-learn for easier access to certain…

388bfb1

… output

Expose cTFIDF model and add several parameters for tuning

45a0811

Online/incremental topic modeling with .partial_fit

9920251

Fix missing bow

7f4ab45

Additional tests and some fixes

698d8fd

MaartenGr mentioned this pull request Aug 17, 2022

Replacing cTF-IDF with other topic word extraction algorithms #679

Closed

MaartenGr added 4 commits August 18, 2022 13:55

Added many, many tests

153286a

Fix selection of topics in visualizations (dynamic and per class)

774ba32

Support for .get_feature_names_out in scikit-learn 1.2

22ba6c3

Update documentation

f4cd883

MaartenGr mentioned this pull request Aug 19, 2022

Does Bertopic have incremental learning capabilities?? #683

Closed

MaartenGr added 14 commits August 26, 2022 17:17

Created three separate overviews of the algorithm at different abstra…

f2aef74

…ction levels

Added .probs_ attribute

506062a

Rename CTfidfTransformer to ClassTfidfTransformer

e534fd6

Add online topic modeling to documentation, added attributes in README

95a3013

Add online tm documentation, example of KeyBERT and BERTopic in tips

fad54e2

Added documentation for ClassTfidfTransformer, OnlineCountVectorizer …

4e15c94

…and some additional changes

Added ctfidf_model to .update_topics, fix #673

8815a37

Merge branch 'master' of https://github.com/MaartenGr/BERTopic into v…

40418b8

…0.12

Replaced instances of 'usage' with 'examples' in docstrings

03d518e

Merge branch 'master' of https://github.com/MaartenGr/BERTopic into v…

bc5b59d

…0.12

Update barchart to fit with the new attribute structure

9cb5eb9

Fix #682

1f8362b

Add changelog

c73815d

Prepare v0.12

654a4d2

MaartenGr mentioned this pull request Sep 4, 2022

Problems with merging topics #698

Closed

Update documentation

6dcbbbe

MaartenGr mentioned this pull request Sep 8, 2022

BERT or SBERT? #704

Closed

Update auto topic reduction documentation

951e0ec

MaartenGr mentioned this pull request Sep 10, 2022

topic_model.get_topic_info(0) returns the info for all topics #710

Closed

MaartenGr added 3 commits September 11, 2022 10:45

Update documentation

753be38

Update readme

91a2b81

Small changes

6293bf1

MaartenGr merged commit 09c1732 into master Sep 11, 2022

MaartenGr deleted the v0.12 branch May 4, 2023 07:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.12 #668

v0.12 #668

MaartenGr commented Aug 10, 2022 •

edited

Loading

v0.12 #668

v0.12 #668

Conversation

MaartenGr commented Aug 10, 2022 • edited Loading

Highlights:

Fixes:

MaartenGr commented Aug 10, 2022 •

edited

Loading