topc_model.transform() breaks the kernel and have to restart the whole notebook again

After training the model and saving it to the disk after using the Rapids library based UMAP and HDBSCAN, when I reload the model and use .transform(), it literally crashed my Kernel and I have to run the entire thing again. The strange thing is that it does not happen when I use the model as soon as it is trained. This happens after I have trained the model and saved it to the local disk. I initially thought it were a memory issue, but inferencing on a single document also creates this issue and ruins the whole progress.

My system is Ubuntu 20.04
rapids - 22.12, python 3.9.15
bertopic - 0.13
GPU - 3090ti

I am training and inferencing on ~ 2 million documents created out of tweets. If I do not use Rapids, it works fine but it messes up when I use the rapids.

The code looks like this - I am not showing the embeddings creations as that code is very long as I am using a custom HFTransformerBackend for one of the BERT models optmized for tweets.

```
from cuml.manifold import UMAP
from cuml.cluster import HDBSCAN
from bertopic import BERTopic
umap_model = UMAP(n_components=10, n_neighbors=15, min_dist=0.0)
hdbscan_model = HDBSCAN(min_samples=2, gen_min_span_tree=True, prediction_data=True) #
# this is to create the new countvectorizer to handle the custom naming and topics
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
german_stop_words = stopwords.words('german')
vectorizer_model = CountVectorizer(stop_words = german_stop_words, ngram_range = (1,2), max_features=20000)
topic_model = BERTopic(
    embedding_model=HFTransformerBackend,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    top_n_words=10,
    diversity=0.2,
    nr_topics=75,
    # calculate_probabilities=True
    # min_topic_size=int(0.001*len(docs))
    )
topics, probs = topic_model.fit_transform(docs)
```

I think the main culprit here is the HDBSCAN model as this is the process where the GPU maxes out to 100% and then breaks. Please help, I have wasted couple of days just figuring this out already.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

topc_model.transform() breaks the kernel and have to restart the whole notebook again #1092

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

topc_model.transform() breaks the kernel and have to restart the whole notebook again #1092

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions