Skip to content

Wrapper for all-neighbors knn graph building #785

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 51 commits into
base: branch-25.06
Choose a base branch
from

Conversation

jinsolp
Copy link
Contributor

@jinsolp jinsolp commented Mar 25, 2025

Description

This PR introduces a wrapper for building all-neighbors knn graphs.
The multi-GPU batched all-neighbors knn graph building algorithm is part of this PR, which enables all-neighbors graph building for datasets that do not fit on device memory.

Batching can be done by using n_clusters > 1. Multi-GPU feature is supported simply by using CUDA_VISIBLE_DEVICES. Further details below!

Related Issue

[Supported ANN algorithms]

  1. IVFPQ: higher recall, takes longer to run
  2. NN Descent: lower recall in general (and has unacceptable low recall issues for some data distributions), but faster to run. Also uses less memory (since it uses half precision)
  3. Plan to add brute force in the future as a separate PR (only usable when n_clusters=1 because using batching means we cannot guarantee 100% accuracy)

[Supported data types]
Currently only supports fp32 data types, but plan to explore and add fp16 data types in the future.

[Number of nearest clusters and total number of clusters]
batch_ann::index has two important configurable parameters.

  • n_nearest_clusters: number of clusters each data is assigned to
  • n_clusters: total number of clusters

Some important notes about these two parameters;

  1. the ratio of n_nearest_clusters / n_clusters determines device memory usage.
    • Approximately (n_nearest_clusters / n_clusters ) * num_rows_in_entire_data number of rows will be put on device memory at once.
    • E.g. between (n_nearest_clusters / n_clusters ) = 2/10 and 2/20, the latter will use less device memory.
  2. larger n_nearest_clusters results in better accuracy of the final all-neighbors knn graph.
    • E.g. With the similar device memory usages, (n_nearest_clusters / n_clusters) = 4/20 will have better accuracy than 2/10 at the cost of performance.
  3. When n_clusters=1, does not perform batching.

Dependent on...


How to Use

Can be used like the following

// in test.cpp file
batch_ann::index_params params;
params.n_clusters         = 10; // total number of clusters
params.n_nearest_clusters = 2; // number of clusters each data is assigned to

// Option 1: For usage with IVFPQ
params.graph_build_params = batch_ann::graph_build_params::ivf_pq_params{};
// Option 2: For usage with NN Descent
params.graph_build_params = batch_ann::graph_build_params::nn_descent_params{};

auto index = batch_ann::build(handle_, host_data_view, k, params);

the batch_ann::build internally runs on multi gpu or single gpu depending on visible cuda devices.

CUDA_VISIBLE_DEVICES=0 ./test.         // runs single GPU batching
CUDA_VISIBLE_DEVICES=0,1,2,3 ./test    // uses 4 GPUS for batching

Numbers

All experiments run on L40 GPUs
[Large dataset to check performance and memory usage]
Below is a result of running on a 269GB dataset (wiki all).

  • IVFPQ takes longer (because it goes through build - search -refine steps), and uses more memory. The reason for using more memory is because NN Descent puts data on device in half precision floating points, regardless of the given data type. Plan to test the impact of using fp16 IVFPQ on recall in the future.
  • Gains are better for IVFPQ because it does more work on the GPU compared to NN Descent.
  • With longer end to end runtime and more memory usage (unless it uses fp16), IVFPQ is still worth using because of recall issues with NN Descent.
Screenshot 2025-03-25 at 1 57 53 PM

[Smaller dataset to verify recall]
It is impossible to generate ground truth all-neighbors graph for a large dataset, so below is the result on a 3.8GB dataset (Deep images) to check how n_nearest_clusters and n_clusters affects recall.

  • Recalls consistent regardless of how many GPUs are used
  • Note the recall differences between (n_nearest_clusters / n_clusters ) = 2/20 and 4/40. These two configurations use similar amount of device memory, but the latter has much better recall
  • Note that NN Descent recalls are generally lower than IVFPQ recalls (NN Descent results would have been much worse for other data distributions at this data size)
  • At this data size, adding more GPUs doesn't always help
Screenshot 2025-03-25 at 2 09 01 PM

@jinsolp jinsolp requested review from a team as code owners March 25, 2025 21:18
@jinsolp jinsolp added feature request New feature or request non-breaking Introduces a non-breaking change and removed CMake labels Mar 25, 2025
@github-actions github-actions bot added the CMake label Mar 25, 2025
struct ivf_pq_params {
cuvs::neighbors::ivf_pq::index_params build_params;
cuvs::neighbors::ivf_pq::search_params search_params;
float refinement_rate = 2.0;
Copy link
Member

@cjnolet cjnolet Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to turn off refinement initially (set it to 1.0) but I'll defer to you on the recall/perf impact there.

@github-actions github-actions bot removed the ci label Apr 16, 2025
@jinsolp
Copy link
Contributor Author

jinsolp commented Apr 17, 2025

/ok to test

@jinsolp
Copy link
Contributor Author

jinsolp commented Apr 17, 2025

/ok to test

@jinsolp jinsolp marked this pull request as ready for review April 17, 2025 20:17
Copy link
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a non-blocking review, and it's mostly just bookkeeping questions.

*/
auto build(const raft::resources& handle,
raft::host_matrix_view<const float, int64_t, row_major> dataset,
int64_t k,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an index param?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understood your comment properly. Are you asking if it should be an index param because you think these parameters are not related to the index and therefore the name seems wrong?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that k should be separated from the index_params as it is the degree of the graph and not part of the params for building the original index.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an index parameter in the sense that the size of the index built depends on the value of k, making it a hyperparameter. It's not the same function as that in a query API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry I think I was looking at the wrong line.
We have k inside the index struct. If we had to choose between having k inside index vs index_params, I think the former makes more sense..??

bool return_distances = false) -> index<int64_t, float>;

/**
* @brief Build a new all-neighbors index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct to assume that this API builds an approximate pairwise distance and index matrix? If yes, can we add that to the docs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@divyegala it's an approximate knn, not an approximate pairwise distance matrix. All-neighbors just means we're computing the nearest neighbors for all the training vectors. I do agree we could define this more clearly in the docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't think of a better term, but wanted it expressed in the docs that this API builds graphs of shape (nrow, nrow)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API builds graphs of shape (nrow x k) which has the top k nearest neighbors for each data. So it's a nearest-neighbor finder on the entire dataset (no separated build and search functions)

void check_metric(const index_params& params)
{
if (std::holds_alternative<graph_build_params::nn_descent_params>(params.graph_build_params)) {
auto allowed_metrics = params.metric == cuvs::distance::DistanceType::L2Expanded ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we allow inner product as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now depending on this PR

size_t num_cols = dataset.extent(1);

int num_ranks = 0;
cudaGetDeviceCount(&num_ranks);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this would be part of the API for the user to tell us how many/which GPUs to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(comment below)

{
int num_ranks = 0;
cudaGetDeviceCount(&num_ranks);
if (num_ranks > 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is the intention but I'm not a fan of the fact that we do multi-GPU build without explicit request from the user. Did I miss somewhere in the docs if it is default behavior? And if the intention is for it to be default, can we add a log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of this API is to just dispatch multi gpu build / single gpu build depending on the given visible cuda devices (so that we don't have to add another parameters of n_gpus to this, and also to UMAP).

Ideally, all the user needs to do to exploit this feature should be changing the visible gpus when running the script like CUDA_VISIBLE_DEVICES=0,1,2,3... without having to explicitly change any parameters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But a user can have multiple visible CUDA devices for many reasons. Just because they have it visible, we should not assume they want us to use them all. This should be opt-in behavior ideally, but otherwise at least a log to let them know we're using all GPUs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is the intention but I'm not a fan of the fact that we do multi-GPU build without explicit request from the user.

100% agreed w/ divye here. In fact, @viclafargue built the single-node multi-GPU indexes for most of the ANN algorithms fairly quickly and the most time was spent on the API experience. We need to be very intentional and careful that we design the user experience and APIs to be consistent, flexible, and easy to use. Ideally we would be defining, configuring, and otherwise specifying multi-gpu configuration information in the underlying raft::resources as much as possible so that we can separate the concerns here. This is where I think you should take a look at the raft::resources APIs more deeply, @jinsolp (in raft/include/core/resource)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @divyegala @cjnolet . Looking through, I realized that we had raft::device_resources_snmg (seems like a relatively new merge). Would we want to use this to handle resources without explicitly making calls like cudaSetDevice?
But if we want the user to specify how many / which GPUs they want to use, should we have the device ids as some form of vector as part of the index_params?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we want the user to specify how many / which GPUs they want to use, should we have the device ids as some form of vector as part of the index_params?

Absolutely not. That's my point- we want the resources/environment to be completely independent of the machine learning hyperparameters. The index_params should be the same no matter if single or multi-gpu. This is now making me realize that we need more than just device_resources_snmg which always assumes nccl. I think what we need is for the algorithm to tell us whether we need a nccl clique and for the device_resources to respond appropriately, lazily, at runtime, based on the needs of the algorithm. This is actually why we use the factory pattern that we use inside the various raft::resource instances. Ideally the device_resources_snmg should carry with it ONLY the gpus that the user wants to use (we can default to all available when unspecified) and for the algorithm to ask for a nccl_clique when it needs it, or not. Does that make sense? We should be able to implement this with a moderately small change tot he current device_resources_snmg.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made changes to existing device_resources_snmg for this.
Related PR in raft: link
This raft PR is a breaking change that breaks cuVS (but no other rapids libraries), and the corresponding fixes to cuVS are in this PR: link

Copy link

copy-pr-bot bot commented Apr 30, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake cpp feature request New feature or request non-breaking Introduces a non-breaking change Python
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

4 participants