Skip to content

Runtime crashes when increasing min_cluster_size  #1180

Closed
@sophvaladou

Description

@sophvaladou

Hello,

I am working with a very large corpus of around 3M documents. Thus, I wanted to increase the min_cluster_size in HDBSCAN to 500 to decrease the number of topics. Moreover, small topics with only a few documents have no value in my research (I am looking for trending Twitter topics), they only matter if there are about 500 documents related to it.

However, when I ran the .fit_transform on these documents after setting the min_cluster_size of the HDBSCAN algorithm to 500, my runtime crashed in Google Colab. I have no clue what happened because after the runtime crashes I cannot see what happened. It crashes instantly so I cannot see what happens to the resources.

I know that it has to be because the resources, either the GPU RAM or the System RAM, are fully used. However, I do not have enough knowledge to know why that happens when increasing min_cluster_size.

Next, I tried decreasing it from 500 to 100 and got the same error. When setting it to 50, it did work. As parameters for the algorithms, I used all the defaults for UMAP and HDBSCAN (except the min_cluster_size).

The issue is that most of the topics have under 500 documents. Out of 4553, 4124 have less than 500 docs, so only 429 topics are "useful". This is a good thing because I don't want 4553 topics (that's too much), but I don't know what to do next. Is it a good idea to use .reduce_topics and set it to around 429, or will this not solve the problem? I could also only use the 429 "big" topics and discard the rest but then I lose quite some information I think.

One last question: out of the 3014471 tweets, 1677106 were assigned topic -1. I know HDBSCAN takes noise into account, but this looks like a lot of noise. Tweets are short so maybe the problem lies there, but is there a way to reduce the noise without having no noise at all? I know I could use other algorithms like k-means, but I think the incorporation of noise is quite useful in my analysis, so I was wondering whether there was a "middle way".

Thank you in advance!!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions