Description
Hello,
I am working with a very large corpus of around 3M documents. Thus, I wanted to increase the min_cluster_size in HDBSCAN to 500 to decrease the number of topics. Moreover, small topics with only a few documents have no value in my research (I am looking for trending Twitter topics), they only matter if there are about 500 documents related to it.
However, when I ran the .fit_transform on these documents after setting the min_cluster_size of the HDBSCAN algorithm to 500, my runtime crashed in Google Colab. I have no clue what happened because after the runtime crashes I cannot see what happened. It crashes instantly so I cannot see what happens to the resources.
I know that it has to be because the resources, either the GPU RAM or the System RAM, are fully used. However, I do not have enough knowledge to know why that happens when increasing min_cluster_size.
Next, I tried decreasing it from 500 to 100 and got the same error. When setting it to 50, it did work. As parameters for the algorithms, I used all the defaults for UMAP and HDBSCAN (except the min_cluster_size).
The issue is that most of the topics have under 500 documents. Out of 4553, 4124 have less than 500 docs, so only 429 topics are "useful". This is a good thing because I don't want 4553 topics (that's too much), but I don't know what to do next. Is it a good idea to use .reduce_topics and set it to around 429, or will this not solve the problem? I could also only use the 429 "big" topics and discard the rest but then I lose quite some information I think.
One last question: out of the 3014471 tweets, 1677106 were assigned topic -1. I know HDBSCAN takes noise into account, but this looks like a lot of noise. Tweets are short so maybe the problem lies there, but is there a way to reduce the noise without having no noise at all? I know I could use other algorithms like k-means, but I think the incorporation of noise is quite useful in my analysis, so I was wondering whether there was a "middle way".
Thank you in advance!!