Modify HDBSCAN membership_vector batch_size check #5455

tarang-jain · 2023-06-05T18:34:21Z

python/cuml/cluster/hdbscan/prediction.pyx

csadorf

Thanks for the fix!

csadorf · 2023-06-05T19:38:00Z

python/cuml/cluster/hdbscan/prediction.pyx

@@ -160,6 +162,9 @@ def all_points_membership_vectors(clusterer, batch_size=4096):
        cluster ``j`` is in ``membership_vectors[i, j]``.
    """

+    if batch_size <= 0:
+        raise ValueError("batch_size should be in integer that is > 0")


There is a typo here and in addition I would suggest a slightly different wording since this not only a recommendation, but a requirement.

Suggested change

raise ValueError("batch_size should be in integer that is > 0")

raise ValueError("batch_size must be > 0")

csadorf · 2023-06-05T19:47:55Z

python/cuml/cluster/hdbscan/prediction.pyx

+        the prediction data is less than 4096, this defaults to the
+        number of rows. If a batch size larger than the number of rows in
+        the prediction data is passed, the batch size used is the number
+        of rows in the prediction data.


I would recommend the following wording of this doc-string here:

Lowers memory requirement by computing distance-based membership
in smaller batches of points in the prediction data. For example, a batch size
of 1,000 computes distance based memberships for 1,000 points at a time.
The default batch size is 4,069.

I would argue that the fact that the batch size is reduced to the number of points to predict in case that it is smaller is self-evident and irrelevant to the user. If you want to keep that information in the docs, then I would recommend the following phrasing:

The default batch size is 4,096 or the number of points to predict (whichever is smaller).

csadorf · 2023-06-05T19:48:11Z

python/cuml/cluster/hdbscan/prediction.pyx

@@ -300,6 +307,9 @@ def membership_vector(clusterer, points_to_predict, batch_size=4096, convert_dty
                         "Please call clusterer.fit again with "
                         "prediction_data=True")

+    if batch_size <= 0:
+        raise ValueError("batch_size should be in integer that is > 0")


Suggested change

raise ValueError("batch_size should be in integer that is > 0")

raise ValueError("batch_size must be > 0")

csadorf · 2023-06-06T19:48:20Z

/merge

Modify batch_size check

4172c2b

tarang-jain requested a review from a team as a code owner June 5, 2023 18:34

github-actions bot added the Cython / Python Cython or Python issue label Jun 5, 2023

tarang-jain changed the title ~~Modify batch_size check~~ Modify HDBSCAN membership_vector batch_size check Jun 5, 2023

tarang-jain added non-breaking Non-breaking change bug Something isn't working 3 - Ready for Review Ready for review by team labels Jun 5, 2023

beckernick reviewed Jun 5, 2023

View reviewed changes

python/cuml/cluster/hdbscan/prediction.pyx Outdated Show resolved Hide resolved

csadorf requested changes Jun 5, 2023

View reviewed changes

tarang-jain added 2 commits June 5, 2023 13:25

Updates after PR reviews

c88dffb

fix spacing

ba4ee06

csadorf approved these changes Jun 6, 2023

View reviewed changes

rapids-bot bot merged commit 20bd4c9 into rapidsai:branch-23.08 Jun 6, 2023

beckernick mentioned this pull request Jun 6, 2023

model.transform() throwing error when using cuml for HDBSCAN with calculate_probabilities=True MaartenGr/BERTopic#1317

Closed

tarang-jain deleted the bug-hdbscan-batchsize branch June 7, 2023 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify HDBSCAN membership_vector batch_size check #5455

Modify HDBSCAN membership_vector batch_size check #5455

tarang-jain commented Jun 5, 2023

csadorf left a comment

csadorf Jun 5, 2023

csadorf Jun 5, 2023

csadorf Jun 5, 2023

csadorf commented Jun 6, 2023

	raise ValueError("batch_size should be in integer that is > 0")
	raise ValueError("batch_size must be > 0")

Modify HDBSCAN membership_vector batch_size check #5455

Modify HDBSCAN membership_vector batch_size check #5455

Conversation

tarang-jain commented Jun 5, 2023

csadorf left a comment

Choose a reason for hiding this comment

csadorf Jun 5, 2023

Choose a reason for hiding this comment

csadorf Jun 5, 2023

Choose a reason for hiding this comment

csadorf Jun 5, 2023

Choose a reason for hiding this comment

csadorf commented Jun 6, 2023