Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify HDBSCAN membership_vector batch_size check #5455

Merged
merged 3 commits into from
Jun 6, 2023

Conversation

tarang-jain
Copy link
Contributor

Closes #5453

@tarang-jain tarang-jain requested a review from a team as a code owner June 5, 2023 18:34
@github-actions github-actions bot added the Cython / Python Cython or Python issue label Jun 5, 2023
@tarang-jain tarang-jain changed the title Modify batch_size check Modify HDBSCAN membership_vector batch_size check Jun 5, 2023
@tarang-jain tarang-jain added non-breaking Non-breaking change bug Something isn't working 3 - Ready for Review Ready for review by team labels Jun 5, 2023
Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@@ -160,6 +162,9 @@ def all_points_membership_vectors(clusterer, batch_size=4096):
cluster ``j`` is in ``membership_vectors[i, j]``.
"""

if batch_size <= 0:
raise ValueError("batch_size should be in integer that is > 0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a typo here and in addition I would suggest a slightly different wording since this not only a recommendation, but a requirement.

Suggested change
raise ValueError("batch_size should be in integer that is > 0")
raise ValueError("batch_size must be > 0")

the prediction data is less than 4096, this defaults to the
number of rows. If a batch size larger than the number of rows in
the prediction data is passed, the batch size used is the number
of rows in the prediction data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend the following wording of this doc-string here:

Lowers memory requirement by computing distance-based membership
in smaller batches of points in the prediction data. For example, a batch size
of 1,000 computes distance based memberships for 1,000 points at a time.
The default batch size is 4,069.

I would argue that the fact that the batch size is reduced to the number of points to predict in case that it is smaller is self-evident and irrelevant to the user. If you want to keep that information in the docs, then I would recommend the following phrasing:

The default batch size is 4,096 or the number of points to predict (whichever is smaller).

@@ -300,6 +307,9 @@ def membership_vector(clusterer, points_to_predict, batch_size=4096, convert_dty
"Please call clusterer.fit again with "
"prediction_data=True")

if batch_size <= 0:
raise ValueError("batch_size should be in integer that is > 0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise ValueError("batch_size should be in integer that is > 0")
raise ValueError("batch_size must be > 0")

@csadorf
Copy link
Contributor

csadorf commented Jun 6, 2023

/merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working Cython / Python Cython or Python issue non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] HDBSCAN membership_vector doesn't gracefully handle points_to_predict smaller than batch_size
3 participants