Skip to content

[FEA] Build Single Linkage Linkage API #820

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 23 commits into
base: branch-25.06
Choose a base branch
from

Conversation

tarang-jain
Copy link
Contributor

@tarang-jain tarang-jain commented Apr 12, 2025

  • Expose functions for building the dendrogram on the mutual reachability graph

Copy link

copy-pr-bot bot commented Apr 12, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@tarang-jain tarang-jain self-assigned this Apr 12, 2025
@github-actions github-actions bot added the cpp label Apr 12, 2025
@tarang-jain tarang-jain added feature request New feature or request non-breaking Introduces a non-breaking change labels Apr 17, 2025
@tarang-jain tarang-jain marked this pull request as ready for review April 18, 2025 03:15
@tarang-jain tarang-jain requested a review from a team as a code owner April 18, 2025 03:15
@tarang-jain
Copy link
Contributor Author

/ok to test

Copy link

copy-pr-bot bot commented Apr 18, 2025

/ok to test

@tarang-jain, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@tarang-jain
Copy link
Contributor Author

/ok to test 2738db7

@tarang-jain
Copy link
Contributor Author

Notes:
These functions have been migrated from HDBSCAN runner in cuML and exposed as a new public API, that is not tested directly here. Instead the full usage is tested end-to-end in HDBSCAN inside cuML. To veify correctness, the cuML PR (rapidsai/cuml#6560), that uses this, currently fetches cuVS from the branch of this cuVS PR and the CI status on that PR can be checked. Once this PR is merged, I will revert the CMake changes to find the correct cuvs version in rapidsai/cuml#6560

Copy link
Contributor

@jinsolp jinsolp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR Tarang! LGTM, just a few small suggestions : )

tarang-jain and others added 3 commits April 18, 2025 18:51
Co-authored-by: Jinsol Park <jinsolp@nvidia.com>
Co-authored-by: Jinsol Park <jinsolp@nvidia.com>
@tarang-jain
Copy link
Contributor Author

/ok to test

Copy link

copy-pr-bot bot commented Apr 19, 2025

/ok to test

@tarang-jain, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@tarang-jain
Copy link
Contributor Author

/ok to test a35ad32

@divyegala
Copy link
Member

/ok to test e0c2171

Copy link

copy-pr-bot bot commented Apr 22, 2025

/ok to test

@tarang-jain, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@tarang-jain
Copy link
Contributor Author

/ok to test e0c2171

cuvs::distance::DistanceType metric,
raft::device_vector_view<float, int> core_dists,
raft::device_vector_view<int, int> indptr,
raft::sparse::COO<float, int>& mutual_reachability_coo,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the public API, can we please use raft::sparse::device_coo_matrix_view for this? I'd prefer not to use deprecated RAFT APIs in cuVS public APIs

raft::device_vector_view<int, int> indptr,
raft::sparse::COO<float, int>& mutual_reachability_coo,
raft::device_vector_view<int, int> out_mst_src,
raft::device_vector_view<int, int> out_mst_dst,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a raft::sparse::device_coo_matrix_view for theout_mst_dst, out_mst_src, and out_mst_weights.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mst is sorted by weights (so the src and dst on the edges wont be in any ascending order). On the other hand, the rows and columns will be sorted if we have a COO, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok upon reading up a bit -- it looks like the COO representation does not have row and column index sorting as a hard requirement. I can definitely use a COO in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A coo isn’t required to be sorted. It’s what we call an edge list. The new sparse matrix types are just convenience containers so we don’t have to pass around and manage 3 separate arrays everywhere.

* @param[out] out_delta distances of output
* @param[out] out_size cluster sizes of output
*/
void build_mutual_reachability_linkage(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put this in the cuvs::cluster::agglomerative::helpers namespace to designate it as a public API that is more of a helper function.

@@ -21,6 +21,7 @@

#include <raft/core/device_mdspan.hpp>
#include <raft/core/resources.hpp>
#include <raft/sparse/coo.hpp>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I'm thinking about this- why is this even in the agglomerative APIs? If this is just the mutual reachaiblity piece, why isn't it in cuvs/neighbors/reachability here? And why can't it be consolidated with the existing mutual_reachaiblity API?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think we might want to separate the connectivities from the linkage here- we have a 'mutual_reachability' API already. Can't we separate the mutual reachability (aka all-neighbors graph construction) from the linkage? We can still expose the linkage functions in cuVS, but putting them all together in cuvs seems too specific to hdbscan.

@@ -39,4 +39,34 @@ void single_linkage(raft::resources const& handle,
handle, X, dendrogram, labels, metric, n_clusters, c);
}
}

void build_mutual_reachability_linkage(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see- this has nothing to do w/ mutual reachability, this is just buildig a linkage. The mutual reachability part would have already been done upstream from here. I think you only need the linkage part here. This is essentialy just building the dendrogram. I would name it appropriately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we are indeed only doing the linkage here, but the connectivities functor used to connect components (to build the mst) is exclusively for Mutual Reachability. I wanted to capture the term "mutual_reachability" in the name to avoid confusion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't name things based on where they are used, we name based on what they are doing. If this isn't doing a mutual reachability then we don't include that in the name. Please name this accordingly or break apart as needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are primitives that are used to construct a linkage (aka knn graph + mst) and dendrogram. The reason we have primitives is so they can be composed in different ways for different algorithms. The more we can break them apart the more reusable they become. I asked you to bring these piece into cuVS because knn graph, mst, and thus single linkage cluster (aka nearest neighbors clustering) are all neighborhood methods.

@tarang-jain
Copy link
Contributor Author

/ok to test 074153a

@tarang-jain tarang-jain changed the title [FEA] Mutual Reachability Linkage [FEA] Build Single Linkage Linkage API Apr 30, 2025
@tarang-jain
Copy link
Contributor Author

/ok to test ee1d5b2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp feature request New feature or request non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants