Skip to content

Conversation

yan-zaretskiy
Copy link

This example shows how to build a CAGRA graph index by streaming host batches:

  1. Stage the first few batches to train an IVF-PQ index.
  2. Incrementally extend the IVF-PQ index with every batch.
  3. Run IVF-PQ search over the full dataset to form an intermediate k-NN graph.
  4. Optimize that graph into the final fixed-degree CAGRA graph.

Closes #1146

Copy link

copy-pr-bot bot commented Sep 29, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cjnolet cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Sep 29, 2025
@@ -0,0 +1,310 @@
/*
* Copyright (c) 2024-2025, NVIDIA CORPORATION.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a new file, we should only use the current year


namespace {

void make_host_dataset(raft::host_matrix_view<float, int64_t, raft::row_major> dataset)
Copy link
Member

@cjnolet cjnolet Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maybe call this "generate host dataset"? Honestly, I think I'd prefer to use make_blobs from raft than to just generate completely random uniform vectors. make_blobs at least has some locality to the vector space that can be exploited by IVFPQ and CAGRA. It would also look a little cleaner to have just a simple function call to "make_blobs" and then a copy to host.


} // namespace

void streaming_cagra_build_example(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pull everything above this point into a corresponding cagra_streaming_example.hpp header for readability? Ideally the user would be able to follow through the meat of the example first, and then refer to the header for all the implementation details.

The other benefit to this approach is that they essentially can drop the header into their own project and ideally just copy/paste the relevant blocks from the source file into their own applications.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. I was trying to follow the existing examples as closely as possible, but I like your idea.

@cjnolet
Copy link
Member

cjnolet commented Sep 30, 2025

/ok to test 877f61d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

[FEA] Establish "streaming batched build" example for CAGRA
2 participants