Skip to content

Conversation

ldematte
Copy link
Contributor

@ldematte ldematte commented Jun 19, 2025

This PR adds the ability to define a Dataset directly over a MemorySegment, "wrapping" it instead of allocating a new one.

  • Depends on [Java] Add Java API benchmarks #1033 and [Java] Encapsulate on-heap float arrays into Dataset #1024
  • The new API has a Object memorySegment parameter, as we target Java 21 for the API (but 22 for the implementation); it works but it's definitely a hack and we need to sort this out
    • As discussed, we want to keep targeting Java 21 for the API. This means the API will return a MethodHandle, and the Java 22 implementation will use it to return a factory method to build a Dataset from a MemorySegment.
    • This factory method can then be used as shown in the tests (see the DatasetHelper convenience class/method).
  • Benchmarks show a sizeable speedup -- it is still tiny related to the "big picture" (index build time), but there is an improvement and above all we avoid a whole new copy of the input data (halving the memory requirements).

Fixes #698

Copy link

copy-pr-bot bot commented Jun 19, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ldematte
Copy link
Contributor Author

I've refactored the code a bit to better handle MemorySegment allocation/deallocation; now I am able to benchmark Dataset creation separately:


Benchmark                                          (dims)  (size)  Mode  Cnt      Score       Error  Units
CagraIndexBenchmarks.testDatasetFromHeap             1024     100  avgt    5  45798.418 ± 12296.887  ns/op
CagraIndexBenchmarks.testDatasetFromMemorySegment    1024     100  avgt    5      2.736 ±     0.016  ns/op

Of course this is a tiny part of building an index; even when I use a small index (size 100, dims 1024) the time to create the index still dominates

Benchmark                                           (dims)  (size)   Mode  Cnt    Score   Error  Units
CagraIndexBenchmarks.testIndexingFromHeap             1024     100  thrpt    5  123.619 ± 1.854  ops/s
CagraIndexBenchmarks.testIndexingFromMemorySegment    1024     100  thrpt    5  184.753 ± 2.229  ops/s

But there still is a benefit, and I have the poorest of the GPUs on my dev machine (it's a 2070 mobile).

@ldematte
Copy link
Contributor Author

When I bump up data size (-p size=N), of course the margin becomes smaller and smaller as the index build time dominates; still, if the data comes from off-heap memory (e.g. memory mapped file), we potentially use half the memory for the input data (or 1/3 in some cases), which I think is the key point.

@ChrisHegarty
Copy link
Contributor

Thanks for doing this @ldematte, I think that this is case for revisiting the Java 21/22 decision that we previously made. I'll take a closer look and give it some thought.

ldematte added 6 commits July 2, 2025 12:10
…segment-dataset

# Conflicts:
#	java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraIndex.java
#	java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/BruteForceIndexImpl.java
#	java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.java
#	java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/DatasetImpl.java
#	java/cuvs-java/src/main/java22/com/nvidia/cuvs/spi/JDKProvider.java
#	java/cuvs-java/src/test/java/com/nvidia/cuvs/CagraBuildAndSearchIT.java
# Conflicts:
#	java/cuvs-java/src/main/java/com/nvidia/cuvs/Dataset.java
#	java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/BruteForceIndexImpl.java
#	java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.java
#	java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/DatasetImpl.java
#	java/cuvs-java/src/main/java22/com/nvidia/cuvs/spi/JDKProvider.java
@mythrocks
Copy link
Contributor

mythrocks commented Jul 10, 2025

Hmm. There seem to be java test failures here:

Error:  Failures: 
Error:    CagraBuildAndSearchIT.testIndexingAndSearchingFlow:145->queryAndCompare:464->CuVSTestCase.checkResults:148 expected:<[{0=0.83774555, 2=0.3590463, 3=0.038782578}, {0=0.12472608, 1=0.31918612, 2=0.21700792}, {0=0.48305473, 2=0.20332818, 3=0.047766715}, {0=0.59063464, 1=0.15224178, 3=0.5986642}]> but was:<[{0=0.23431994, 2=0.23431946, 3=0.23431946}, {0=0.68692017, 2=0.6869197, 3=0.6869197}, {0=0.33261782, 2=0.3326173, 3=0.3326173}, {0=0.3379389, 2=0.33793885, 3=0.33793885}]>
Error:  Errors: 
Error:    CagraBuildAndSearchIT.testIndexing:233->runConcurrently:85 ? Execution java.lang.AssertionError: Exception while executing runnable: java.lang.RuntimeException: cuvsCagraBuild returned 0[RAFT failure at file=/tmp/conda-bld-output/bld/rattler-build_libcuvs/work/cpp/src/neighbors/detail/cagra/graph_core.cuh line=1406: Could not generate an intermediate CAGRA graph because the initial kNN graph contains too many invalid or duplicated neighbor nodes. This error can occur, for example, if too many overflows occur during the norm computation between the dataset vectors.
Obtained 6 stack frames
#1 in /opt/conda/envs/java/lib/jvm/lib/server/../../../libcuvs.so(+0x49117d) [0x7c03a02ec17d]
#2 in /opt/conda/envs/java/lib/jvm/lib/server/../../../libcuvs.so: void cuvs::neighbors::cagra::detail::graph::optimize<unsigned int, raft::host_device_accessor<std::experimental::default_accessor<unsigned int>, (raft::memory_type)0> >(raft::resources const&, std::experimental::mdspan<unsigned int, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<unsigned int>, (raft::memory_type)0> >, std::experimental::mdspan<unsigned int, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<unsigned int>, (raft::memory_type)0> >, bool, bool) +0x1aa3 [0x7c03a0d3b4d3]
#3 in /opt/conda/envs/java/lib/jvm/lib/server/../../../libcuvs.so: cuvs::neighbors::cagra::index<float, unsigned int> cuvs::neighbors::cagra::detail::build<float, unsigned int, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >(raft::resources const&, cuvs::neighbors::cagra::index_params const&, std::experimental::mdspan<float const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >) +0xd81 [0x7c03a0d629b1]
#4 in /opt/conda/envs/java/lib/jvm/lib/server/../../../libcuvs.so: cuvs::neighbors::cagra::build(raft::resources const&, cuvs::neighbors::cagra::index_params const&, std::experimental::mdspan<float const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >) +0x21 [0x7c03a0d1d761]
#5 in /opt/conda/envs/java/lib/jvm/lib/server/../../../libcuvs_c.so: cuvsCagraBuild +0x806 [0x7c04352dbb96]
#6 in [0x7c045c0a6ec7]

@ldematte
Copy link
Contributor Author

Hmm. There seem to be java test failures here:

Yeah, it's a chicken and egg problem :) I fixed those in another PR (#1089) but that PR depends on this to be merged first. These tests randomly fail because they are using resources and threading wrong.
I will cherry pick the test fixes from #1089 so we can unblock the situation.

@ldematte
Copy link
Contributor Author

(BTW, the issue is already there, it's just random that it happened to fail in this PR -- better fix it sooner than later)
@mythrocks, I cherry picked the test fix, it is good to go again.

@mythrocks
Copy link
Contributor

/ok to test 8582abb

@ldematte
Copy link
Contributor Author

ldematte commented Jul 14, 2025

@mythrocks is there anything else I can do to unblock this PR?

@mythrocks
Copy link
Contributor

/ok to test 63c4932

@mythrocks
Copy link
Contributor

The CI builds should be fixed once #1114 is resolved.

@benfred
Copy link
Member

benfred commented Jul 15, 2025

/ok to test 2e65a55

This was referenced Jul 15, 2025
@mythrocks
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit 32d6d83 into rapidsai:branch-25.08 Jul 15, 2025
53 checks passed
punAhuja added a commit to SearchScale/cuvs that referenced this pull request Jul 15, 2025
This PR adds the ability to define a Dataset directly over a MemorySegment, "wrapping" it instead of allocating a new one.

- Depends on rapidsai#1033 and rapidsai#1024
- ~~The new API has a `Object memorySegment` parameter, as we target Java 21 for the API (but 22 for the implementation); it works but it's definitely a hack and we need to sort this out~~
   - As discussed, we want to keep targeting Java 21 for the API. This means the API will return a `MethodHandle`, and the Java 22 implementation will use it to return a factory method to build a Dataset from a MemorySegment.
   - This factory method can then be used as shown in the tests (see the `DatasetHelper` convenience class/method).
- Benchmarks show a sizeable speedup -- it is still tiny related to the "big picture" (index build time), but there is an improvement and above all we avoid a whole new copy of the input data (halving the memory requirements).

Fixes rapidsai#698

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)
  - MithunR (https://github.com/mythrocks)
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Chris Hegarty (https://github.com/ChrisHegarty)
  - MithunR (https://github.com/mythrocks)

URL: rapidsai#1034
punAhuja added a commit to SearchScale/cuvs that referenced this pull request Jul 15, 2025
This PR adds the ability to define a Dataset directly over a MemorySegment, "wrapping" it instead of allocating a new one.

- Depends on rapidsai#1033 and rapidsai#1024
- ~~The new API has a `Object memorySegment` parameter, as we target Java 21 for the API (but 22 for the implementation); it works but it's definitely a hack and we need to sort this out~~
   - As discussed, we want to keep targeting Java 21 for the API. This means the API will return a `MethodHandle`, and the Java 22 implementation will use it to return a factory method to build a Dataset from a MemorySegment.
   - This factory method can then be used as shown in the tests (see the `DatasetHelper` convenience class/method).
- Benchmarks show a sizeable speedup -- it is still tiny related to the "big picture" (index build time), but there is an improvement and above all we avoid a whole new copy of the input data (halving the memory requirements).

Fixes rapidsai#698

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)
  - MithunR (https://github.com/mythrocks)
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Chris Hegarty (https://github.com/ChrisHegarty)
  - MithunR (https://github.com/mythrocks)

URL: rapidsai#1034
punAhuja added a commit to SearchScale/cuvs that referenced this pull request Jul 16, 2025
This PR adds the ability to define a Dataset directly over a MemorySegment, "wrapping" it instead of allocating a new one.

- Depends on rapidsai#1033 and rapidsai#1024
- ~~The new API has a `Object memorySegment` parameter, as we target Java 21 for the API (but 22 for the implementation); it works but it's definitely a hack and we need to sort this out~~
   - As discussed, we want to keep targeting Java 21 for the API. This means the API will return a `MethodHandle`, and the Java 22 implementation will use it to return a factory method to build a Dataset from a MemorySegment.
   - This factory method can then be used as shown in the tests (see the `DatasetHelper` convenience class/method).
- Benchmarks show a sizeable speedup -- it is still tiny related to the "big picture" (index build time), but there is an improvement and above all we avoid a whole new copy of the input data (halving the memory requirements).

Fixes rapidsai#698

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)
  - MithunR (https://github.com/mythrocks)
  - Ben Frederickson (https://github.com/benfred)

Approvers:
  - Chris Hegarty (https://github.com/ChrisHegarty)
  - MithunR (https://github.com/mythrocks)

URL: rapidsai#1034
rapids-bot bot pushed a commit that referenced this pull request Jul 26, 2025
In #902 and #1034 we introduced a `Dataset` interface to support on-heap and off-heap ("native") memory seamlessly as inputs for cagra and bruteforce index building.

As we expand the functionality of cuvs-java, we realized we have similar needs for outputs (see e.g. #1105 / #1102 or #1104).

This PR extends `Dataset` to support being used as an output, wrapping native (off-heap) memory in a convenient and efficient way, and providing common utilities to transform to and from on-heap memory.
This work is inspired by the existing raft `mdspan` and `DLTensor` data structures, but tailored to our needs (2d only, just 3 data types, etc.). The PR keeps the current implementation simple and minimal on purpose, but structured in a way that is simple to extend.

By itself, the PR is just a refactoring to extend the `Dataset` implementation and reorganize the implementation classes; its real usefulness will be in using it in the PRs mentioned above (in fact, this PR has been extracted from #1105).
The implementation class hierarchy is implemented with future extensions in mind: atm we have one `HostMemoryDatasetImpl`, but we are already thinking to have a corresponding `DeviceMemoryDatasetImpl` that will wrap and manage (views) on GPU memory to avoid (in some cases) extra copies of data from GPU memory to CPU memory only to process them or forward them to another algorithm (e.g quantization followed by indexing).

Future work will also include add support/refactoring to allocate and manage GPU memory and DLTensors (e.g. working better with/refactoring `prepareTensor`).

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)
  - MithunR (https://github.com/mythrocks)

Approvers:
  - MithunR (https://github.com/mythrocks)

URL: #1111
lowener pushed a commit to lowener/cuvs that referenced this pull request Aug 11, 2025
…#1111)

In rapidsai#902 and rapidsai#1034 we introduced a `Dataset` interface to support on-heap and off-heap ("native") memory seamlessly as inputs for cagra and bruteforce index building.

As we expand the functionality of cuvs-java, we realized we have similar needs for outputs (see e.g. rapidsai#1105 / rapidsai#1102 or rapidsai#1104).

This PR extends `Dataset` to support being used as an output, wrapping native (off-heap) memory in a convenient and efficient way, and providing common utilities to transform to and from on-heap memory.
This work is inspired by the existing raft `mdspan` and `DLTensor` data structures, but tailored to our needs (2d only, just 3 data types, etc.). The PR keeps the current implementation simple and minimal on purpose, but structured in a way that is simple to extend.

By itself, the PR is just a refactoring to extend the `Dataset` implementation and reorganize the implementation classes; its real usefulness will be in using it in the PRs mentioned above (in fact, this PR has been extracted from rapidsai#1105).
The implementation class hierarchy is implemented with future extensions in mind: atm we have one `HostMemoryDatasetImpl`, but we are already thinking to have a corresponding `DeviceMemoryDatasetImpl` that will wrap and manage (views) on GPU memory to avoid (in some cases) extra copies of data from GPU memory to CPU memory only to process them or forward them to another algorithm (e.g quantization followed by indexing).

Future work will also include add support/refactoring to allocate and manage GPU memory and DLTensors (e.g. working better with/refactoring `prepareTensor`).

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)
  - MithunR (https://github.com/mythrocks)

Approvers:
  - MithunR (https://github.com/mythrocks)

URL: rapidsai#1111
enp1s0 pushed a commit to enp1s0/cuvs that referenced this pull request Aug 22, 2025
…#1111)

In rapidsai#902 and rapidsai#1034 we introduced a `Dataset` interface to support on-heap and off-heap ("native") memory seamlessly as inputs for cagra and bruteforce index building.

As we expand the functionality of cuvs-java, we realized we have similar needs for outputs (see e.g. rapidsai#1105 / rapidsai#1102 or rapidsai#1104).

This PR extends `Dataset` to support being used as an output, wrapping native (off-heap) memory in a convenient and efficient way, and providing common utilities to transform to and from on-heap memory.
This work is inspired by the existing raft `mdspan` and `DLTensor` data structures, but tailored to our needs (2d only, just 3 data types, etc.). The PR keeps the current implementation simple and minimal on purpose, but structured in a way that is simple to extend.

By itself, the PR is just a refactoring to extend the `Dataset` implementation and reorganize the implementation classes; its real usefulness will be in using it in the PRs mentioned above (in fact, this PR has been extracted from rapidsai#1105).
The implementation class hierarchy is implemented with future extensions in mind: atm we have one `HostMemoryDatasetImpl`, but we are already thinking to have a corresponding `DeviceMemoryDatasetImpl` that will wrap and manage (views) on GPU memory to avoid (in some cases) extra copies of data from GPU memory to CPU memory only to process them or forward them to another algorithm (e.g quantization followed by indexing).

Future work will also include add support/refactoring to allocate and manage GPU memory and DLTensors (e.g. working better with/refactoring `prepareTensor`).

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)
  - MithunR (https://github.com/mythrocks)

Approvers:
  - MithunR (https://github.com/mythrocks)

URL: rapidsai#1111
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality Java non-breaking Introduces a non-breaking change
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

cuvs-java: Support providing indexing data off-heap
5 participants