Description
After training the model and saving it to the disk after using the Rapids library based UMAP and HDBSCAN, when I reload the model and use .transform(), it literally crashed my Kernel and I have to run the entire thing again. The strange thing is that it does not happen when I use the model as soon as it is trained. This happens after I have trained the model and saved it to the local disk. I initially thought it were a memory issue, but inferencing on a single document also creates this issue and ruins the whole progress.
My system is Ubuntu 20.04
rapids - 22.12, python 3.9.15
bertopic - 0.13
GPU - 3090ti
I am training and inferencing on ~ 2 million documents created out of tweets. If I do not use Rapids, it works fine but it messes up when I use the rapids.
The code looks like this - I am not showing the embeddings creations as that code is very long as I am using a custom HFTransformerBackend for one of the BERT models optmized for tweets.
from cuml.manifold import UMAP
from cuml.cluster import HDBSCAN
from bertopic import BERTopic
umap_model = UMAP(n_components=10, n_neighbors=15, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=2, gen_min_span_tree=True, prediction_data=True) #
# this is to create the new countvectorizer to handle the custom naming and topics
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
german_stop_words = stopwords.words('german')
vectorizer_model = CountVectorizer(stop_words = german_stop_words, ngram_range = (1,2), max_features=20000)
topic_model = BERTopic(
embedding_model=HFTransformerBackend,
umap_model=umap_model,
hdbscan_model=hdbscan_model,
vectorizer_model=vectorizer_model,
top_n_words=10,
diversity=0.2,
nr_topics=75,
# calculate_probabilities=True
# min_topic_size=int(0.001*len(docs))
)
topics, probs = topic_model.fit_transform(docs)
I think the main culprit here is the HDBSCAN model as this is the process where the GPU maxes out to 100% and then breaks. Please help, I have wasted couple of days just figuring this out already.