Add introductory documentation by ashkrisk · Pull Request #597 · datastax/jvector

ashkrisk · 2026-01-20T13:47:11Z

Add better documentation for anyone getting started with JVector.

marianotepper

Looks good in general. Some minor comments here and there.
There is one section that may need an update following some recent changes by @jshook.

docs/benchmarking.md

marianotepper · 2026-01-23T18:06:21Z

docs/benchmarking.md

+- **HDF5**: The dataset consists of a single HDF5 file with three datasets labelled `train`, `test` and `neighbors`, representing the base vectors, query vectors and the ground truth.
+
+General procedure for running benchmarks:
+- Specify the dataset names to benchmark in `datasets.yml`.


This is unclear. Do we need to modify datasets.yml, or chose from it? If the latter, where do we specify the chosen ones?

Added a link to a following section which describes how "specifying datasets" works, and also reworked the section to make it more clear. Let me know if you'd prefer I elaborate a bit here.

docs/benchmarking.md

marianotepper · 2026-01-23T18:09:36Z

docs/benchmarking.md

+
+### Using Fvec/Ivec datasets
+
+Using fvec/ivec datasets requires them to be configured in `MultiFileDatasource.java`. Some datasets are already pre-configured; these will be downloaded and used automatically on running the benchmark.


This section needs some tweaking following the latest changes from @jshook. Maybe @jshook can help to adjust this section?

docs/draft-hello.md

marianotepper · 2026-01-23T18:25:22Z

jvector-examples/src/main/java/io/github/jbellis/jvector/example/VectorIntro.java

+import io.github.jbellis.jvector.vector.types.VectorFloat;
+import io.github.jbellis.jvector.vector.types.VectorTypeSupport;
+
+public class VectorIntro {


There should be a way to grab sections of this file from draft-hello.md, so that we do not have to manually sync both files.

A quick couple of searches didn't turn up a simple way to do this such that it works in all contexts. I don't think we should spend too much energy on this, because anyone changing VectorIntro.java should be making sure that intro-tutorial.md is also in sync, considering that it's content is heavily reliant on VectorIntro. I'll add comment to VectorIntro.java making the link more clear.

docs/benchmarking.md

MarkWolters · 2026-01-26T18:46:13Z

docs/benchmarking.md

+    - `neighbors.ivec` containing Q K-dimensional integer vectors, one for each query vector, representing the exact K-nearest neighbors for that query among the base vectors.
+    The files can be named however you like.
+- Save all three files somewhere in the `fvec` directory in the root of the `jvector` repo (if it doesn't exist, create it). It's recommended to create at least one sub-folder with the name of the dataset and copy or move all three files there.
+- Edit `MultiFileDatasource.java` to configure a new dataset and it's associated files:


This section also needs to describe how we are using the environment variable DATASET_HASH and how it needs to be set on the target system or the downloads will fail

This would be applicable to specific pre-configured datasets (cohere 1M-10M, dpr etc) that are defined in MultiFileDatasource. If I'm not mistaken, accessing these datasets requires not only the dataset hash, but also the credentials of the corresponding S3 bucket. I think it's alright to skip it for now considering that the affected datasets are not public.

Note that this won't prevent the user from externally downloading and using any dataset they have access to.

docs/intro-tutorial.md

MarkWolters · 2026-01-26T18:54:15Z

docs/intro-tutorial.md

+int topK = 10;  // number of approximate nearest neighbors to fetch
+// You can provide a filter to the query as a bit mask.
+// In this case we want the actual topK neighbors without filtering,
+// so we pass in a virtual bit mask representing all ones.


An example of what is meant by filtering when something other than Bits.ALL is used might be useful here.

I wanted to avoid digressing too much at this point. I've only mentioned filtering since I can't run a search without passing in some Bits instance, and I didn't want to do that with zero commentary. I was planning to elaborate further in other docs, but do you think it's too confusing without it?

jvector-examples/src/main/java/io/github/jbellis/jvector/example/VectorIntro.java

Add introductory documentation

63ea529

marianotepper reviewed Jan 23, 2026

View reviewed changes

MarkWolters requested changes Jan 26, 2026

View reviewed changes

ashkrisk added 4 commits January 27, 2026 19:01

Refine benchmark docs

dd392d6

Fixes to intro tutorial

d2a029a

Fix license and clarify link b/w tutorial and code

c838797

Add disk tutorial

723aad2


		### Using Fvec/Ivec datasets

		Using fvec/ivec datasets requires them to be configured in `MultiFileDatasource.java`. Some datasets are already pre-configured; these will be downloaded and used automatically on running the benchmark.

Conversation

ashkrisk commented Jan 20, 2026

Uh oh!

marianotepper left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashkrisk Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ashkrisk Jan 27, 2026 •

edited

Loading