Skip to content

topc_model.transform() breaks the kernel and have to restart the whole notebook again #1092

Closed
@hrishbhdalal

Description

@hrishbhdalal

After training the model and saving it to the disk after using the Rapids library based UMAP and HDBSCAN, when I reload the model and use .transform(), it literally crashed my Kernel and I have to run the entire thing again. The strange thing is that it does not happen when I use the model as soon as it is trained. This happens after I have trained the model and saved it to the local disk. I initially thought it were a memory issue, but inferencing on a single document also creates this issue and ruins the whole progress.

My system is Ubuntu 20.04
rapids - 22.12, python 3.9.15
bertopic - 0.13
GPU - 3090ti

I am training and inferencing on ~ 2 million documents created out of tweets. If I do not use Rapids, it works fine but it messes up when I use the rapids.

The code looks like this - I am not showing the embeddings creations as that code is very long as I am using a custom HFTransformerBackend for one of the BERT models optmized for tweets.

from cuml.manifold import UMAP
from cuml.cluster import HDBSCAN
from bertopic import BERTopic
umap_model = UMAP(n_components=10, n_neighbors=15, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=2, gen_min_span_tree=True, prediction_data=True) #
# this is to create the new countvectorizer to handle the custom naming and topics
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
german_stop_words = stopwords.words('german')
vectorizer_model = CountVectorizer(stop_words = german_stop_words, ngram_range = (1,2), max_features=20000)
topic_model = BERTopic(
    embedding_model=HFTransformerBackend,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    top_n_words=10,
    diversity=0.2,
    nr_topics=75,
    # calculate_probabilities=True
    # min_topic_size=int(0.001*len(docs))
    )
topics, probs = topic_model.fit_transform(docs)

I think the main culprit here is the HDBSCAN model as this is the process where the GPU maxes out to 100% and then breaks. Please help, I have wasted couple of days just figuring this out already.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions