[FEA] Reduce peak memory requirements of HDBSCAN fit with large min_samples

cuML requires significantly more memory to fit HDBSCAN as `min_cluster_size` increases than the CPU version does. This can cause users to unexpectedly run into memory issues when switching from CPU to GPU execution (e.g., [here](https://github.com/MaartenGr/BERTopic/issues/1180)).

In the example below, cuML requires roughly the same amount of peak memory as the CPU version when `min_cluster_size=5` (default). However, cuML requires twice as much peak memory than the CPU version when `min_cluster_size=500` (20 GB vs 10 GB). For large workloads in which people would be most likely to use cuML, this can cause out-of-memory errors. This pattern appears to be consistent or get slightly worse as we increase the number of records. With 800,000 samples, cuML requires 36-37 GB of memory and the CPU version requires only 16 GB.

We should try to reduce the peak memory requirements for HDBSCAN with large `min_cluster_size` values. If this can only be done performantly in conjunction with something like https://github.com/rapidsai/raft/issues/543, we should add a note to the documentation about memory and `min_cluster_size` and re-evaluate once while exploring nn-descent.

The example below is with cuML 23.04 and hdbscan 0.8.29 in a Jupyter Notebook (thus the use of `%%memit` IPython magic)

```python
# !pip install memory_profiler
# https://github.com/pythonprofilers/memory_profiler

%load_ext memory_profiler

import cuml
import numpy as np
import hdbscan
from sklearn.datasets import make_blobs

X, y = make_blobs(
    n_samples=400000,
    n_features=5,
    random_state=12
)
```

```python
# Peak memory: 4 GB
clusterer = cuml.cluster.hdbscan.HDBSCAN(
    min_cluster_size=5,
)
clusterer.fit(X)
```

```python
# Peak memory: 20 GB
clusterer = cuml.cluster.hdbscan.HDBSCAN(
    min_cluster_size=500,
)
clusterer.fit(X)
```

```python
%%memit

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=5,
)
clusterer.fit(X)
peak memory: 3886.86 MiB, increment: 146.93 MiB
```
```python
%%memit

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=500,
)
clusterer.fit(X)
peak memory: 9864.25 MiB, increment: 6035.55 MiB
```

EDIT: See comment. This is related to the core point neighbors (`min_samples`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Reduce peak memory requirements of HDBSCAN fit with large min_samples #5357

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Reduce peak memory requirements of HDBSCAN fit with large min_samples #5357

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions