Skip to content

[FEA] Reduce peak memory requirements of HDBSCAN fit with large min_samples #5357

Open
@beckernick

Description

@beckernick

cuML requires significantly more memory to fit HDBSCAN as min_cluster_size increases than the CPU version does. This can cause users to unexpectedly run into memory issues when switching from CPU to GPU execution (e.g., here).

In the example below, cuML requires roughly the same amount of peak memory as the CPU version when min_cluster_size=5 (default). However, cuML requires twice as much peak memory than the CPU version when min_cluster_size=500 (20 GB vs 10 GB). For large workloads in which people would be most likely to use cuML, this can cause out-of-memory errors. This pattern appears to be consistent or get slightly worse as we increase the number of records. With 800,000 samples, cuML requires 36-37 GB of memory and the CPU version requires only 16 GB.

We should try to reduce the peak memory requirements for HDBSCAN with large min_cluster_size values. If this can only be done performantly in conjunction with something like rapidsai/raft#543, we should add a note to the documentation about memory and min_cluster_size and re-evaluate once while exploring nn-descent.

The example below is with cuML 23.04 and hdbscan 0.8.29 in a Jupyter Notebook (thus the use of %%memit IPython magic)

# !pip install memory_profiler
# https://github.com/pythonprofilers/memory_profiler

%load_ext memory_profiler

import cuml
import numpy as np
import hdbscan
from sklearn.datasets import make_blobs

X, y = make_blobs(
    n_samples=400000,
    n_features=5,
    random_state=12
)
# Peak memory: 4 GB
clusterer = cuml.cluster.hdbscan.HDBSCAN(
    min_cluster_size=5,
)
clusterer.fit(X)
# Peak memory: 20 GB
clusterer = cuml.cluster.hdbscan.HDBSCAN(
    min_cluster_size=500,
)
clusterer.fit(X)
%%memit

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=5,
)
clusterer.fit(X)
peak memory: 3886.86 MiB, increment: 146.93 MiB
%%memit

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=500,
)
clusterer.fit(X)
peak memory: 9864.25 MiB, increment: 6035.55 MiB

EDIT: See comment. This is related to the core point neighbors (min_samples)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions