Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugs/1592 Resolved two bugs in BatchParallel Clustering #1593

Conversation

mrfh92
Copy link
Collaborator

@mrfh92 mrfh92 commented Aug 1, 2024

Due Diligence

  • General:
  • Implementation:
    • unit tests: all split configurations tested
    • unit tests: multiple dtypes tested
    • documentation updated where needed

Description

Issue/s resolved: #1592

Changes proposed:

  • output labels had int32-data type as Heat-arrays, but internally they were int64-floats in torch; this causes problems when going on with computations after clustering. Now, the torch labels are cast to int32; this is no problem as nobody will have more than int32 cluster centers.
  • torch.multinomial used for K-Means++ initialization on each MPI-process has a limit of 2^24 elements (at least on GPU). Thus, if there are more than 2^24 elements to cluster on each process, we now take a uniform subsample before doing K-Means++.

Type of change

Memory requirements

Performance

Does this change modify the behaviour of other functions? If so, which?

yes / no

@mrfh92 mrfh92 added bug Something isn't working cluster high-level functions High-level machine-learning algorithms labels Aug 1, 2024
mrfh92 added 2 commits August 5, 2024 09:23
…ustering_initialization_limited_to_2_24_elements_per_MPI-process_int32/int64-datatype-mismatch_in_labels
…ustering_initialization_limited_to_2_24_elements_per_MPI-process_int32/int64-datatype-mismatch_in_labels
Copy link
Contributor

github-actions bot commented Aug 8, 2024

Thank you for the PR!

Copy link

codecov bot commented Aug 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.04%. Comparing base (00119f6) to head (3c636cd).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1593   +/-   ##
=======================================
  Coverage   92.04%   92.04%           
=======================================
  Files          83       83           
  Lines       12110    12113    +3     
=======================================
+ Hits        11147    11150    +3     
  Misses        963      963           
Flag Coverage Δ
unit 92.04% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mrfh92 mrfh92 requested review from mtar and JuanPedroGHM August 8, 2024 08:19
Copy link
Member

@JuanPedroGHM JuanPedroGHM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, not sure if this is the proper way to handle the bugs.

@@ -289,7 +293,7 @@ def predict(self, x: DNDarray):

local_labels = _parallel_batched_kmex_predict(
x.larray, self._cluster_centers.larray, self._p
)
).to(torch.int32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not do it the other way? Set the heat array to the proper output type? I get the argument that it is an unlikely number of clusters, but it could theoretically happen.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thought about this and my arguments for the chosen solution were:

  • int32 saves 50% of memory compared to int64 during further processing of the outcome of the clustering
  • in theory, more than int32 cluster centers can be thought of, but in practice this is completely out of scope as the runtime of our clustering algorithms heavily depend on the number of cluster centers and also the reason for doing clustering is usually to get an insight in the structure of data by grouping them into a comparably small number of clusters.

@@ -19,20 +19,24 @@
"""


def _initialize_plus_plus(X, n_clusters, p, random_state=None):
def _initialize_plus_plus(X, n_clusters, p, random_state=None, max_samples=2**24 - 1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some unsuspecting user could try to change this value to something higher, and encounter the limit on torch. Should we hard code it?

Copy link
Collaborator Author

@mrfh92 mrfh92 Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is already hard code as this is an auxiliary function that is not made available to the user directly.
The reason for introducing max_samples as a variable was to have some flexibility for adapting this in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a comment in the functions description.

heat/cluster/batchparallelclustering.py Outdated Show resolved Hide resolved
Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

…ustering_initialization_limited_to_2_24_elements_per_MPI-process_int32/int64-datatype-mismatch_in_labels
Copy link
Contributor

Thank you for the PR!

…ustering_initialization_limited_to_2_24_elements_per_MPI-process_int32/int64-datatype-mismatch_in_labels
@mrfh92 mrfh92 merged commit 19a6bd9 into main Aug 13, 2024
6 checks passed
@mrfh92 mrfh92 deleted the bugs/1592-_Bug_Two_bugs_in_batch-parallel_clustering_initialization_limited_to_2_24_elements_per_MPI-process_int32/int64-datatype-mismatch_in_labels branch August 13, 2024 07:33
Copy link
Contributor

Thank you for the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cluster high-level functions High-level machine-learning algorithms
Projects
None yet
2 participants