Skip to content

vecbench: download dataset as separate files #149940

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 16, 2025

Conversation

andy-kimball
Copy link
Contributor

Update DatasetLoader (used by vecbench and the vecann workload) to download the train, test, and neighbors vector sets as separate files rather than as a single file. Separating the files will allow huge train datasets to be separated into multiple files. For the new files, use the .fbin and .ibin formats:

.fbin
[num_vectors (uint32), vector_dim (uint32), vector_array (float32)]

.ibin
[num_vectors (uint32), num_neighbors_per_vector (uint32), neighbor_array (int32)]

Epic: CRDB-42943
Release note: None

@andy-kimball andy-kimball requested review from DrewKimball and mw5h July 10, 2025 23:36
@andy-kimball andy-kimball requested a review from a team as a code owner July 10, 2025 23:36
@andy-kimball andy-kimball requested review from herkolategan and golgeek and removed request for a team July 10, 2025 23:36
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Update DatasetLoader (used by vecbench and the vecann workload) to
download the train, test, and neighbors vector sets as separate files
rather than as a single file. Separating the files will allow huge
train datasets to be separated into multiple files. For the new files,
use the .fbin and .ibin formats:

.fbin
[num_vectors (uint32), vector_dim (uint32), vector_array (float32)]

.ibin
[num_vectors (uint32), num_neighbors_per_vector (uint32), neighbor_array (int32)]

Epic: CRDB-42943
Release note: None
Copy link
Collaborator

@DrewKimball DrewKimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 7 of 7 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @golgeek, @herkolategan, and @mw5h)


pkg/workload/vecann/datasets.go line 145 at r1 (raw file):

		trainCount = trainSet.Count
	} else {
		// Only read the header to get the train vector count.

Maybe there should be a single metadata file with the counts + dimensions?

Copy link
Contributor Author

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @golgeek, @herkolategan, and @mw5h)


pkg/workload/vecann/datasets.go line 145 at r1 (raw file):

Previously, DrewKimball (Drew Kimball) wrote…

Maybe there should be a single metadata file with the counts + dimensions?

I think that would add some complexity that wouldn't be worth the benefit. It's easy and efficient to read the first few bytes of the file. We can revisit if we find some good reason to do that, though.

@andy-kimball
Copy link
Contributor Author

bors r=drewkimball

@craig
Copy link
Contributor

craig bot commented Jul 16, 2025

@craig craig bot merged commit 7fca691 into cockroachdb:master Jul 16, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants