Description
cuML requires significantly more memory to fit HDBSCAN as min_cluster_size
increases than the CPU version does. This can cause users to unexpectedly run into memory issues when switching from CPU to GPU execution (e.g., here).
In the example below, cuML requires roughly the same amount of peak memory as the CPU version when min_cluster_size=5
(default). However, cuML requires twice as much peak memory than the CPU version when min_cluster_size=500
(20 GB vs 10 GB). For large workloads in which people would be most likely to use cuML, this can cause out-of-memory errors. This pattern appears to be consistent or get slightly worse as we increase the number of records. With 800,000 samples, cuML requires 36-37 GB of memory and the CPU version requires only 16 GB.
We should try to reduce the peak memory requirements for HDBSCAN with large min_cluster_size
values. If this can only be done performantly in conjunction with something like rapidsai/raft#543, we should add a note to the documentation about memory and min_cluster_size
and re-evaluate once while exploring nn-descent.
The example below is with cuML 23.04 and hdbscan 0.8.29 in a Jupyter Notebook (thus the use of %%memit
IPython magic)
# !pip install memory_profiler
# https://github.com/pythonprofilers/memory_profiler
%load_ext memory_profiler
import cuml
import numpy as np
import hdbscan
from sklearn.datasets import make_blobs
X, y = make_blobs(
n_samples=400000,
n_features=5,
random_state=12
)
# Peak memory: 4 GB
clusterer = cuml.cluster.hdbscan.HDBSCAN(
min_cluster_size=5,
)
clusterer.fit(X)
# Peak memory: 20 GB
clusterer = cuml.cluster.hdbscan.HDBSCAN(
min_cluster_size=500,
)
clusterer.fit(X)
%%memit
clusterer = hdbscan.HDBSCAN(
min_cluster_size=5,
)
clusterer.fit(X)
peak memory: 3886.86 MiB, increment: 146.93 MiB
%%memit
clusterer = hdbscan.HDBSCAN(
min_cluster_size=500,
)
clusterer.fit(X)
peak memory: 9864.25 MiB, increment: 6035.55 MiB
EDIT: See comment. This is related to the core point neighbors (min_samples
)