Skip to content

Add introductory documentation#597

Draft
ashkrisk wants to merge 5 commits intodatastax:mainfrom
ashkrisk:docs
Draft

Add introductory documentation#597
ashkrisk wants to merge 5 commits intodatastax:mainfrom
ashkrisk:docs

Conversation

@ashkrisk
Copy link
Contributor

Add better documentation for anyone getting started with JVector.

Copy link
Contributor

@marianotepper marianotepper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general. Some minor comments here and there.
There is one section that may need an update following some recent changes by @jshook.

- **HDF5**: The dataset consists of a single HDF5 file with three datasets labelled `train`, `test` and `neighbors`, representing the base vectors, query vectors and the ground truth.

General procedure for running benchmarks:
- Specify the dataset names to benchmark in `datasets.yml`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unclear. Do we need to modify datasets.yml, or chose from it? If the latter, where do we specify the chosen ones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a link to a following section which describes how "specifying datasets" works, and also reworked the section to make it more clear. Let me know if you'd prefer I elaborate a bit here.


### Using Fvec/Ivec datasets

Using fvec/ivec datasets requires them to be configured in `MultiFileDatasource.java`. Some datasets are already pre-configured; these will be downloaded and used automatically on running the benchmark.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section needs some tweaking following the latest changes from @jshook. Maybe @jshook can help to adjust this section?

import io.github.jbellis.jvector.vector.types.VectorFloat;
import io.github.jbellis.jvector.vector.types.VectorTypeSupport;

public class VectorIntro {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a way to grab sections of this file from draft-hello.md, so that we do not have to manually sync both files.

Copy link
Contributor Author

@ashkrisk ashkrisk Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick couple of searches didn't turn up a simple way to do this such that it works in all contexts. I don't think we should spend too much energy on this, because anyone changing VectorIntro.java should be making sure that intro-tutorial.md is also in sync, considering that it's content is heavily reliant on VectorIntro. I'll add comment to VectorIntro.java making the link more clear.

- `neighbors.ivec` containing Q K-dimensional integer vectors, one for each query vector, representing the exact K-nearest neighbors for that query among the base vectors.
The files can be named however you like.
- Save all three files somewhere in the `fvec` directory in the root of the `jvector` repo (if it doesn't exist, create it). It's recommended to create at least one sub-folder with the name of the dataset and copy or move all three files there.
- Edit `MultiFileDatasource.java` to configure a new dataset and it's associated files:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section also needs to describe how we are using the environment variable DATASET_HASH and how it needs to be set on the target system or the downloads will fail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be applicable to specific pre-configured datasets (cohere 1M-10M, dpr etc) that are defined in MultiFileDatasource. If I'm not mistaken, accessing these datasets requires not only the dataset hash, but also the credentials of the corresponding S3 bucket. I think it's alright to skip it for now considering that the affected datasets are not public.

Note that this won't prevent the user from externally downloading and using any dataset they have access to.

int topK = 10; // number of approximate nearest neighbors to fetch
// You can provide a filter to the query as a bit mask.
// In this case we want the actual topK neighbors without filtering,
// so we pass in a virtual bit mask representing all ones.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of what is meant by filtering when something other than Bits.ALL is used might be useful here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid digressing too much at this point. I've only mentioned filtering since I can't run a search without passing in some Bits instance, and I didn't want to do that with zero commentary. I was planning to elaborate further in other docs, but do you think it's too confusing without it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants