-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify HDBSCAN membership_vector batch_size check #5455
Modify HDBSCAN membership_vector batch_size check #5455
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix!
@@ -160,6 +162,9 @@ def all_points_membership_vectors(clusterer, batch_size=4096): | |||
cluster ``j`` is in ``membership_vectors[i, j]``. | |||
""" | |||
|
|||
if batch_size <= 0: | |||
raise ValueError("batch_size should be in integer that is > 0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a typo here and in addition I would suggest a slightly different wording since this not only a recommendation, but a requirement.
raise ValueError("batch_size should be in integer that is > 0") | |
raise ValueError("batch_size must be > 0") |
the prediction data is less than 4096, this defaults to the | ||
number of rows. If a batch size larger than the number of rows in | ||
the prediction data is passed, the batch size used is the number | ||
of rows in the prediction data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend the following wording of this doc-string here:
Lowers memory requirement by computing distance-based membership
in smaller batches of points in the prediction data. For example, a batch size
of 1,000 computes distance based memberships for 1,000 points at a time.
The default batch size is 4,069.
I would argue that the fact that the batch size is reduced to the number of points to predict in case that it is smaller is self-evident and irrelevant to the user. If you want to keep that information in the docs, then I would recommend the following phrasing:
The default batch size is 4,096 or the number of points to predict (whichever is smaller).
@@ -300,6 +307,9 @@ def membership_vector(clusterer, points_to_predict, batch_size=4096, convert_dty | |||
"Please call clusterer.fit again with " | |||
"prediction_data=True") | |||
|
|||
if batch_size <= 0: | |||
raise ValueError("batch_size should be in integer that is > 0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise ValueError("batch_size should be in integer that is > 0") | |
raise ValueError("batch_size must be > 0") |
/merge |
Closes #5453