Wrapper for all-neighbors knn graph building #785

jinsolp · 2025-03-25T21:18:40Z

Description

This PR introduces a wrapper for building all-neighbors knn graphs.
The multi-GPU batched all-neighbors knn graph building algorithm is part of this PR, which enables all-neighbors graph building for datasets that do not fit on device memory.

Batching can be done by using n_clusters > 1. Multi-GPU feature is supported simply by using CUDA_VISIBLE_DEVICES. Further details below!

Related Issue

[FEA] General batch knn multi-gpu algorithm implementation #727

[Supported ANN algorithms]

IVFPQ: higher recall, takes longer to run
NN Descent: lower recall in general (and has unacceptable low recall issues for some data distributions), but faster to run. Also uses less memory (since it uses half precision)
Plan to add brute force in the future as a separate PR (only usable when n_clusters=1 because using batching means we cannot guarantee 100% accuracy)

[Supported data types]
Currently only supports fp32 data types, but plan to explore and add fp16 data types in the future.

[Number of nearest clusters and total number of clusters]
batch_ann::index has two important configurable parameters.

n_nearest_clusters: number of clusters each data is assigned to
n_clusters: total number of clusters

Some important notes about these two parameters;

the ratio of n_nearest_clusters / n_clusters determines device memory usage.
- Approximately (n_nearest_clusters / n_clusters ) * num_rows_in_entire_data number of rows will be put on device memory at once.
- E.g. between (n_nearest_clusters / n_clusters ) = 2/10 and 2/20, the latter will use less device memory.
larger n_nearest_clusters results in better accuracy of the final all-neighbors knn graph.
- E.g. With the similar device memory usages, (n_nearest_clusters / n_clusters) = 4/20 will have better accuracy than 2/10 at the cost of performance.
When n_clusters=1, does not perform batching.

Dependent on...

Resource management: Decoupling multi gpu resources from nccl usage raft#2647, Change snmg index to use nccl-dependent device resource #869
InnderProduct metric support: [FIX] Revert negated InnerProduct distance for NN Descent #859

How to Use

Can be used like the following

// in test.cpp file
batch_ann::index_params params;
params.n_clusters         = 10; // total number of clusters
params.n_nearest_clusters = 2; // number of clusters each data is assigned to

// Option 1: For usage with IVFPQ
params.graph_build_params = batch_ann::graph_build_params::ivf_pq_params{};
// Option 2: For usage with NN Descent
params.graph_build_params = batch_ann::graph_build_params::nn_descent_params{};

auto index = batch_ann::build(handle_, host_data_view, k, params);

the batch_ann::build internally runs on multi gpu or single gpu depending on visible cuda devices.

CUDA_VISIBLE_DEVICES=0 ./test.         // runs single GPU batching
CUDA_VISIBLE_DEVICES=0,1,2,3 ./test    // uses 4 GPUS for batching

Numbers

All experiments run on L40 GPUs
[Large dataset to check performance and memory usage]
Below is a result of running on a 269GB dataset (wiki all).

IVFPQ takes longer (because it goes through build - search -refine steps), and uses more memory. The reason for using more memory is because NN Descent puts data on device in half precision floating points, regardless of the given data type. Plan to test the impact of using fp16 IVFPQ on recall in the future.
Gains are better for IVFPQ because it does more work on the GPU compared to NN Descent.
With longer end to end runtime and more memory usage (unless it uses fp16), IVFPQ is still worth using because of recall issues with NN Descent.

[Smaller dataset to verify recall]
It is impossible to generate ground truth all-neighbors graph for a large dataset, so below is the result on a 3.8GB dataset (Deep images) to check how n_nearest_clusters and n_clusters affects recall.

Recalls consistent regardless of how many GPUs are used
Note the recall differences between (n_nearest_clusters / n_clusters ) = 2/20 and 4/40. These two configurations use similar amount of device memory, but the latter has much better recall
Note that NN Descent recalls are generally lower than IVFPQ recalls (NN Descent results would have been much worse for other data distributions at this data size)
At this data size, adding more GPUs doesn't always help

…snmg-batching

cpp/include/cuvs/neighbors/batch_ann.hpp

cpp/src/neighbors/batch_ann/batch_ann.cu

cpp/src/neighbors/batch_ann/batch_ann.cuh

cpp/include/cuvs/neighbors/batch_ann.hpp

cjnolet · 2025-04-01T21:41:22Z

cpp/include/cuvs/neighbors/batch_ann.hpp

+struct ivf_pq_params {
+  cuvs::neighbors::ivf_pq::index_params build_params;
+  cuvs::neighbors::ivf_pq::search_params search_params;
+  float refinement_rate = 2.0;


We might want to turn off refinement initially (set it to 1.0) but I'll defer to you on the recall/perf impact there.

cpp/src/neighbors/batch_ann/batch_ann.cuh

cpp/src/neighbors/batch_ann/batch_ann_builder.cuh

…snmg-batching

jinsolp · 2025-04-17T01:53:43Z

/ok to test

jinsolp · 2025-04-17T16:33:50Z

/ok to test

…snmg-batching

divyegala

This is a non-blocking review, and it's mostly just bookkeeping questions.

divyegala · 2025-04-26T02:38:34Z

cpp/include/cuvs/neighbors/all_neighbors.hpp

+ */
+auto build(const raft::resources& handle,
+           raft::host_matrix_view<const float, int64_t, row_major> dataset,
+           int64_t k,


Should this be an index param?

I'm not sure if I understood your comment properly. Are you asking if it should be an index param because you think these parameters are not related to the index and therefore the name seems wrong?

I agree that k should be separated from the index_params as it is the degree of the graph and not part of the params for building the original index.

It is an index parameter in the sense that the size of the index built depends on the value of k, making it a hyperparameter. It's not the same function as that in a query API.

Oh sorry I think I was looking at the wrong line.
We have k inside the index struct. If we had to choose between having k inside index vs index_params, I think the former makes more sense..??

divyegala · 2025-04-26T02:41:01Z

cpp/include/cuvs/neighbors/all_neighbors.hpp

+           bool return_distances = false) -> index<int64_t, float>;
+
+/**
+ * @brief Build a new all-neighbors index


Am I correct to assume that this API builds an approximate pairwise distance and index matrix? If yes, can we add that to the docs?

@divyegala it's an approximate knn, not an approximate pairwise distance matrix. All-neighbors just means we're computing the nearest neighbors for all the training vectors. I do agree we could define this more clearly in the docs.

I couldn't think of a better term, but wanted it expressed in the docs that this API builds graphs of shape (nrow, nrow)

This API builds graphs of shape (nrow x k) which has the top k nearest neighbors for each data. So it's a nearest-neighbor finder on the entire dataset (no separated build and search functions)

cpp/include/cuvs/neighbors/all_neighbors.hpp

divyegala · 2025-04-26T02:44:03Z

cpp/src/neighbors/all_neighbors/all_neighbors.cuh

+void check_metric(const index_params& params)
+{
+  if (std::holds_alternative<graph_build_params::nn_descent_params>(params.graph_build_params)) {
+    auto allowed_metrics = params.metric == cuvs::distance::DistanceType::L2Expanded ||


Don't we allow inner product as well?

Now depending on this PR

divyegala · 2025-04-26T02:48:06Z

cpp/src/neighbors/all_neighbors/all_neighbors_batched.cuh

+  size_t num_cols = dataset.extent(1);
+
+  int num_ranks = 0;
+  cudaGetDeviceCount(&num_ranks);


Ideally this would be part of the API for the user to tell us how many/which GPUs to use.

(comment below)

cpp/src/neighbors/all_neighbors/all_neighbors_batched.cuh

divyegala · 2025-04-26T02:53:54Z

cpp/src/neighbors/all_neighbors/all_neighbors_batched.cuh

+{
+  int num_ranks = 0;
+  cudaGetDeviceCount(&num_ranks);
+  if (num_ranks > 1) {


I don't know if this is the intention but I'm not a fan of the fact that we do multi-GPU build without explicit request from the user. Did I miss somewhere in the docs if it is default behavior? And if the intention is for it to be default, can we add a log?

The goal of this API is to just dispatch multi gpu build / single gpu build depending on the given visible cuda devices (so that we don't have to add another parameters of n_gpus to this, and also to UMAP).

Ideally, all the user needs to do to exploit this feature should be changing the visible gpus when running the script like CUDA_VISIBLE_DEVICES=0,1,2,3... without having to explicitly change any parameters.

But a user can have multiple visible CUDA devices for many reasons. Just because they have it visible, we should not assume they want us to use them all. This should be opt-in behavior ideally, but otherwise at least a log to let them know we're using all GPUs.

I don't know if this is the intention but I'm not a fan of the fact that we do multi-GPU build without explicit request from the user.

100% agreed w/ divye here. In fact, @viclafargue built the single-node multi-GPU indexes for most of the ANN algorithms fairly quickly and the most time was spent on the API experience. We need to be very intentional and careful that we design the user experience and APIs to be consistent, flexible, and easy to use. Ideally we would be defining, configuring, and otherwise specifying multi-gpu configuration information in the underlying raft::resources as much as possible so that we can separate the concerns here. This is where I think you should take a look at the raft::resources APIs more deeply, @jinsolp (in raft/include/core/resource)

Thanks @divyegala @cjnolet . Looking through, I realized that we had raft::device_resources_snmg (seems like a relatively new merge). Would we want to use this to handle resources without explicitly making calls like cudaSetDevice?
But if we want the user to specify how many / which GPUs they want to use, should we have the device ids as some form of vector as part of the index_params?

But if we want the user to specify how many / which GPUs they want to use, should we have the device ids as some form of vector as part of the index_params?

Absolutely not. That's my point- we want the resources/environment to be completely independent of the machine learning hyperparameters. The index_params should be the same no matter if single or multi-gpu. This is now making me realize that we need more than just device_resources_snmg which always assumes nccl. I think what we need is for the algorithm to tell us whether we need a nccl clique and for the device_resources to respond appropriately, lazily, at runtime, based on the needs of the algorithm. This is actually why we use the factory pattern that we use inside the various raft::resource instances. Ideally the device_resources_snmg should carry with it ONLY the gpus that the user wants to use (we can default to all available when unspecified) and for the algorithm to ask for a nccl_clique when it needs it, or not. Does that make sense? We should be able to implement this with a moderately small change tot he current device_resources_snmg.

I have made changes to existing device_resources_snmg for this.
Related PR in raft: link
This raft PR is a breaking change that breaks cuVS (but no other rapids libraries), and the corresponding fixes to cuVS are in this PR: link

copy-pr-bot · 2025-04-30T01:13:04Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

jinsolp added 2 commits March 25, 2025 20:27

MG ann batch build

0a30e16

Merge branch 'rapidsai:branch-25.04' into snmg-batching

b1ef22c

jinsolp requested review from a team as code owners March 25, 2025 21:18

github-actions bot added cpp CMake labels Mar 25, 2025

jinsolp added feature request New feature or request non-breaking Introduces a non-breaking change and removed CMake labels Mar 25, 2025

jinsolp added 2 commits March 25, 2025 21:42

organize headers

cc555d8

Merge branch 'snmg-batching' of https://github.com/jinsolp/cuvs into …

302334b

…snmg-batching

github-actions bot added the CMake label Mar 25, 2025

jinsolp added 5 commits March 25, 2025 15:34

Merge branch 'branch-25.04' into snmg-batching

79ca58a

makefile stylecheck

7d516c0

Merge branch 'snmg-batching' of https://github.com/jinsolp/cuvs into …

5eb7856

…snmg-batching

cleanup metric and add build

ec52b65

Merge branch 'branch-25.04' into snmg-batching

a699b06