Skip to content

Commit

Permalink
[yugabyte#23998] Update third-party dependencies and enable SimSIMD i…
Browse files Browse the repository at this point in the history
…n Usearch

Summary:
- Upgrade SimSIMD and Usearch
- Enable SimSIMD in Usearch in as many configurations as possible
- Add options to save index to a file and load it from a file in hnsw_tool

Test Plan:
Jenkins

Inline third-party dependencies would be added on to the diff using `build-support/thirdparty_tool --sync-inline-thirdparty` and tested on Jenkins before this diff is landed.

Reviewers: sergei

Reviewed By: sergei

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D38349
  • Loading branch information
mbautin committed Oct 1, 2024
1 parent fa38152 commit 6baf188
Show file tree
Hide file tree
Showing 13 changed files with 215 additions and 43 deletions.
7 changes: 5 additions & 2 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -404,8 +404,11 @@ if(IS_GCC)
# unnecessary. This does not happen in debug.
# https://gist.githubusercontent.com/mbautin/02d955abbee29f58c0d0d9cf7ab3291d/raw
ADD_CXX_FLAGS("-Wno-uninitialized")

# Also the use-after-free detector complains about Boost multi index container.
endif()
if ("${COMPILER_VERSION}" MATCHES "^1[23][.].*$" AND
"${YB_BUILD_TYPE}" STREQUAL "release")
# Also the use-after-free detector complains about Boost multi index container in both GCC 12
# and GCC 13.
# https://gist.githubusercontent.com/mbautin/de18543ea85d46db49dfa4b4b7df082a/raw
ADD_CXX_FLAGS("-Wno-use-after-free")
endif()
Expand Down
6 changes: 3 additions & 3 deletions build-support/inline_thirdparty.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
dependencies:
- name: usearch
git_url: https://github.com/unum-cloud/usearch
commit: 240fe9c298100f9e37a2d7377b1595be6ba1f412
commit: 191d9bb46fe5e2a44d1505ce7563ed51c7e55868
src_dir: include
dest_dir: usearch

Expand All @@ -33,7 +33,7 @@ dependencies:
dest_dir: hnswlib/hnswlib

- name: simsimd
git_url: https://github.com/ashvardanian/simsimd
git_url: https://github.com/yugabyte/simsimd
src_dir: include
dest_dir: simsimd
tag: v5.1.0
tag: v5.4.3-yb-1
89 changes: 68 additions & 21 deletions build-support/thirdparty_archives.yml
Original file line number Diff line number Diff line change
@@ -1,101 +1,148 @@
sha: b6b07342fdfd4a65ee2608d75dd31e4b0ecc0737
sha: d4cf290eedeb2daeaa8617f9c129bcd1044b4448
archives:

- os_type: almalinux8
architecture: x86_64
compiler_type: clang17
tag: v20240713003527-b6b07342fd-almalinux8-x86_64-clang17
sha: 224279f6e85b8e54b55bf37a834a2d92295fd811
tag: v20240922072648-224279f6e8-almalinux8-x86_64-clang17

- os_type: almalinux8
architecture: x86_64
compiler_type: clang18
tag: v20240713003521-b6b07342fd-almalinux8-x86_64-clang18
sha: 224279f6e85b8e54b55bf37a834a2d92295fd811
tag: v20240922072646-224279f6e8-almalinux8-x86_64-clang18

- os_type: almalinux8
architecture: x86_64
compiler_type: gcc11
tag: v20240713003520-b6b07342fd-almalinux8-x86_64-gcc11
sha: 224279f6e85b8e54b55bf37a834a2d92295fd811
tag: v20240922072647-224279f6e8-almalinux8-x86_64-gcc11

- os_type: almalinux8
architecture: x86_64
compiler_type: gcc12
sha: 224279f6e85b8e54b55bf37a834a2d92295fd811
tag: v20240922072654-224279f6e8-almalinux8-x86_64-gcc12

- os_type: almalinux8
architecture: x86_64
compiler_type: gcc13
sha: 224279f6e85b8e54b55bf37a834a2d92295fd811
tag: v20240922072650-224279f6e8-almalinux8-x86_64-gcc13

- os_type: almalinux9
architecture: x86_64
compiler_type: clang17
tag: v20240713003540-b6b07342fd-almalinux9-x86_64-clang17
sha: 224279f6e85b8e54b55bf37a834a2d92295fd811
tag: v20240922072659-224279f6e8-almalinux9-x86_64-clang17

- os_type: almalinux9
architecture: x86_64
compiler_type: gcc12
tag: v20240713003537-b6b07342fd-almalinux9-x86_64-gcc12
sha: 224279f6e85b8e54b55bf37a834a2d92295fd811
tag: v20240922072655-224279f6e8-almalinux9-x86_64-gcc12

- os_type: almalinux9
architecture: x86_64
compiler_type: gcc13
sha: 224279f6e85b8e54b55bf37a834a2d92295fd811
tag: v20240922072658-224279f6e8-almalinux9-x86_64-gcc13

- os_type: amzn2
architecture: aarch64
compiler_type: clang17
tag: v20240713003725-b6b07342fd-amzn2-aarch64-clang17
sha: 224279f6e85b8e54b55bf37a834a2d92295fd811
tag: v20240922072848-224279f6e8-amzn2-aarch64-clang17

- os_type: amzn2
architecture: aarch64
compiler_type: clang17
sha: 224279f6e85b8e54b55bf37a834a2d92295fd811
lto_type: full
tag: v20240713003827-b6b07342fd-amzn2-aarch64-clang17-full-lto
tag: v20240922073005-224279f6e8-amzn2-aarch64-clang17-full-lto

- os_type: amzn2
architecture: aarch64
compiler_type: clang18
tag: v20240713003831-b6b07342fd-amzn2-aarch64-clang18
sha: 224279f6e85b8e54b55bf37a834a2d92295fd811
tag: v20240922073006-224279f6e8-amzn2-aarch64-clang18

- os_type: amzn2
architecture: aarch64
compiler_type: clang18
sha: 224279f6e85b8e54b55bf37a834a2d92295fd811
lto_type: full
tag: v20240713003853-b6b07342fd-amzn2-aarch64-clang18-full-lto
tag: v20240922072959-224279f6e8-amzn2-aarch64-clang18-full-lto

- os_type: amzn2
architecture: x86_64
compiler_type: clang17
tag: v20240713003538-b6b07342fd-amzn2-x86_64-clang17
sha: 224279f6e85b8e54b55bf37a834a2d92295fd811
tag: v20240922072704-224279f6e8-amzn2-x86_64-clang17

- os_type: amzn2
architecture: x86_64
compiler_type: clang17
sha: 224279f6e85b8e54b55bf37a834a2d92295fd811
lto_type: full
tag: v20240713003542-b6b07342fd-amzn2-x86_64-clang17-full-lto
tag: v20240922072706-224279f6e8-amzn2-x86_64-clang17-full-lto

- os_type: amzn2
architecture: x86_64
compiler_type: clang18
tag: v20240713003544-b6b07342fd-amzn2-x86_64-clang18
sha: 224279f6e85b8e54b55bf37a834a2d92295fd811
tag: v20240922072707-224279f6e8-amzn2-x86_64-clang18

- os_type: amzn2
architecture: x86_64
compiler_type: clang18
sha: 224279f6e85b8e54b55bf37a834a2d92295fd811
lto_type: full
tag: v20240713003540-b6b07342fd-amzn2-x86_64-clang18-full-lto
tag: v20240922072704-224279f6e8-amzn2-x86_64-clang18-full-lto

- os_type: macos
architecture: arm64
compiler_type: clang
tag: v20240713011052-b6b07342fd-macos-arm64
tag: v20240925171709-d4cf290eed-macos-arm64

- os_type: macos
architecture: x86_64
compiler_type: clang
tag: v20240713003540-b6b07342fd-macos-x86_64
tag: v20240925171933-d4cf290eed-macos-x86_64

- os_type: ubuntu20.04
architecture: x86_64
compiler_type: clang16
tag: v20240713003520-b6b07342fd-ubuntu2004-x86_64-clang16
sha: 224279f6e85b8e54b55bf37a834a2d92295fd811
tag: v20240922072638-224279f6e8-ubuntu2004-x86_64-clang16

- os_type: ubuntu22.04
architecture: x86_64
compiler_type: clang17
tag: v20240713003517-b6b07342fd-ubuntu2204-x86_64-clang17
sha: 224279f6e85b8e54b55bf37a834a2d92295fd811
tag: v20240922072641-224279f6e8-ubuntu2204-x86_64-clang17

- os_type: ubuntu22.04
architecture: x86_64
compiler_type: gcc11
tag: v20240713003527-b6b07342fd-ubuntu2204-x86_64-gcc11
sha: 224279f6e85b8e54b55bf37a834a2d92295fd811
tag: v20240922072641-224279f6e8-ubuntu2204-x86_64-gcc11

- os_type: ubuntu22.04
architecture: x86_64
compiler_type: gcc12
sha: 224279f6e85b8e54b55bf37a834a2d92295fd811
tag: v20240922072638-224279f6e8-ubuntu2204-x86_64-gcc12

- os_type: ubuntu24.04
architecture: x86_64
compiler_type: clang17
sha: 224279f6e85b8e54b55bf37a834a2d92295fd811
tag: v20240922072648-224279f6e8-ubuntu2404-x86_64-clang17

- os_type: ubuntu23.04
- os_type: ubuntu24.04
architecture: x86_64
compiler_type: gcc13
tag: v20240713003516-b6b07342fd-ubuntu2304-x86_64-gcc13
sha: 224279f6e85b8e54b55bf37a834a2d92295fd811
tag: v20240922072700-224279f6e8-ubuntu2404-x86_64-gcc13
3 changes: 2 additions & 1 deletion cmake_modules/YugabyteFindThirdParty.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -320,7 +320,8 @@ macro(yb_find_third_party_dependencies)
# TODO: display this only if using a devtoolset compiler on CentOS, and ideally only if the error
# actually happens.
message("Note: if Boost fails to find Threads, you might need to install the "
"devtoolset-N-libatomic-devel package for the devtoolset you are using.")
"gcc-toolset-N-libatomic-devel package, or devtoolset-N-libatomic-devel package for "
"older RedHat/CentOS versions, where N is the toolset version number.")
endif()

# Find Boost static libraries.
Expand Down
9 changes: 8 additions & 1 deletion python/yugabyte/inline_thirdparty.py
Original file line number Diff line number Diff line change
Expand Up @@ -236,9 +236,16 @@ def make_commit(
logging.info(f"Created an automatic commit for {dep.name}")


def sync_inline_thirdparty() -> None:
def sync_inline_thirdparty(deps_to_sync: List[str] = []) -> None:
config = read_yaml(INLINE_THIRDPARTY_CONFIG_PATH)
validate_config(config)
if deps_to_sync:
all_dep_names = {dep.name for dep in config.dependencies}
invalid_deps = [dep for dep in deps_to_sync if dep not in all_dep_names]
if invalid_deps:
raise ValueError(
f"The following specified dependencies do not exist: {', '.join(invalid_deps)}")
config.dependencies = [dep for dep in config.dependencies if dep.name in deps_to_sync]
if not git_util.is_git_clean(common_util.YB_SRC_ROOT):
raise RuntimeError(f"Local changes exist, cannot update inline third-party dependencies.")
clone_and_copy_subtrees(config.dependencies)
2 changes: 1 addition & 1 deletion python/yugabyte/thirdparty_tool.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ def main() -> None:
return

if args.sync_inline_thirdparty:
inline_thirdparty.sync_inline_thirdparty()
inline_thirdparty.sync_inline_thirdparty(args.inline_thirdparty_deps)
return

metadata = load_metadata()
Expand Down
5 changes: 5 additions & 0 deletions python/yugabyte/thirdparty_tool_impl.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,11 @@ def parse_args() -> argparse.Namespace:
inline_thirdparty.INLINE_THIRDPARTY_SRC_DIR,
inline_thirdparty.INLINE_THIRDPARTY_CONFIG_PATH
))
parser.add_argument(
'--inline-thirdparty-deps',
nargs='+',
help='Names of inline third-party dependencies to sync. If not specified, all dependencies '
'will be synced.')

if len(sys.argv) == 1:
parser.print_help(sys.stderr)
Expand Down
45 changes: 38 additions & 7 deletions src/yb/tools/hnsw_tool.cc
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,9 @@ struct BenchmarkArguments {
size_t num_index_shards = 1;
std::string build_vecs_path;
std::string ground_truth_path;
std::string load_index_from_path;
std::string query_vecs_path;
std::string save_index_to_path;
ANNMethodKind ann_method;

std::string ToString() const {
Expand Down Expand Up @@ -167,6 +169,11 @@ std::unique_ptr<OptionsDescription> BenchmarkOptions() {
"Input file containing vectors to build the index on, in the fvecs/bvecs/ivecs format.")
(OPTIONAL_ARG_FIELD(query_vecs_path),
"Input file containing vectors to query the dataset with, in the fvecs/bvecs/ivecs format.")
(OPTIONAL_ARG_FIELD(save_index_to_path),
"Save the index to this path.")
(OPTIONAL_ARG_FIELD(load_index_from_path),
"Load the index from this path, or read it from disk without loading fully into memory, "
"if the index supports it. This supersedes the index build procedure.")
(OPTIONAL_ARG_FIELD(ground_truth_path),
"Input file containing integer vectors of correct nearest neighbor vector identifiers "
"(0-based in the input dataset) for each query.")
Expand Down Expand Up @@ -196,7 +203,9 @@ std::unique_ptr<OptionsDescription> BenchmarkOptions() {
(HNSW_OPTION_ARG(ml),
"The scaling factor used to randomly select the level for a newly added vertex. "
"Setting this to 1 / log(2), or ~1.44, results in the average number of points at every "
"level being half of the number of points at the level below it. Higher values of this ")
"level being half of the number of points at the level below it. Higher values of this "
"parameter result in a more compact graph with fewer levels, but may increase the search "
"time.")
(HNSW_OPTION_ARG(ef_construction),
"The number of closest neighbors at each level that are used to determine the candidates "
"used for constructing the neighborhood of a newly added vertex. Higher values result in "
Expand Down Expand Up @@ -275,7 +284,7 @@ class BenchmarkTool {
const BenchmarkArguments& args,
PreVectorIndexFactory<IndexedVector, IndexedDistanceResult> index_factory)
: args_(args),
index_factory_(std::move(index_factory)) {
index_pre_factory_(std::move(index_factory)) {
}

Status Execute() {
Expand Down Expand Up @@ -315,7 +324,7 @@ class BenchmarkTool {
PrintConfiguration();

if (args_.num_index_shards > 1) {
index_factory_ = [pre_factory = index_factory_, num_shards = args_.num_index_shards](
index_pre_factory_ = [pre_factory = index_pre_factory_, num_shards = args_.num_index_shards](
const HNSWOptions& options) {
return [factory = pre_factory(options), num_shards]() {
return std::make_unique<ShardedVectorIndex<IndexedVector, IndexedDistanceResult>>(
Expand All @@ -324,9 +333,26 @@ class BenchmarkTool {
};
}

vector_index_ = index_factory_(hnsw_options())();
vector_index_ = index_pre_factory_(hnsw_options())();

RETURN_NOT_OK(BuildIndex());
RETURN_NOT_OK(vector_index_->Reserve(num_points_to_insert()));
if (!args_.load_index_from_path.empty()) {
LOG(INFO) << "Loading index from " << args_.load_index_from_path;
auto load_start_time = MonoTime::Now();
RETURN_NOT_OK(vector_index_->AttachToFile(args_.load_index_from_path));
LOG(INFO) << "Loaded index from " << args_.load_index_from_path
<< " in " << MonoTime::Now().GetDeltaSince(load_start_time);
} else {
RETURN_NOT_OK(BuildIndex());
if (!args_.save_index_to_path.empty()) {
auto save_start_time = MonoTime::Now();
LOG(INFO) << "Saving index to " << args_.save_index_to_path;
RETURN_NOT_OK(vector_index_->SaveToFile(args_.save_index_to_path));
LOG(INFO) << "Saved index to " << args_.save_index_to_path
<< " in " << MonoTime::Now().GetDeltaSince(save_start_time);

}
}

RETURN_NOT_OK(Validate());

Expand Down Expand Up @@ -546,7 +572,12 @@ class BenchmarkTool {
num_vectors_to_load, total_mem_required_mb, args_.max_memory_for_loading_vectors_mb);
}

MonoTime load_start_time = MonoTime::Now();
LOG(INFO) << "Loading vectors from " << indexed_vector_source_->file_path() << "...";
input_vectors_ = VERIFY_RESULT(indexed_vector_source_->LoadVectors(num_vectors_to_load));
LOG(INFO) << "Loaded " << input_vectors_.size() << " vectors in "
<< (MonoTime::Now() - load_start_time);

size_t num_points_used = 0;
const size_t max_to_insert = max_num_vectors_to_insert();

Expand Down Expand Up @@ -650,7 +681,7 @@ class BenchmarkTool {
// Source for vectors to run validation queries on.
std::unique_ptr<VectorSource<InputVector>> query_vector_source_;

PreVectorIndexFactory<IndexedVector, IndexedDistanceResult> index_factory_;
PreVectorIndexFactory<IndexedVector, IndexedDistanceResult> index_pre_factory_;
VectorIndexIfPtr<IndexedVector, IndexedDistanceResult> vector_index_;

// Atomics used in multithreaded index construction.
Expand Down Expand Up @@ -712,7 +743,7 @@ Status BenchmarkExecute(const BenchmarkArguments& args) {
((Usearch, L2Squared, float, float )) \
((Usearch, L2Squared, uint8_t, float )) \
((Hnswlib, L2Squared, float, float )) \
((Hnswlib, L2Squared, uint8_t, uint8_t)) \
((Hnswlib, L2Squared, uint8_t, float)) \
/* Cosine similarity */ \
((Usearch, Cosine, float, float )) \
((Usearch, Cosine, uint8_t, float )) \
Expand Down
21 changes: 21 additions & 0 deletions src/yb/vector/hnswlib_wrapper.cc
Original file line number Diff line number Diff line change
Expand Up @@ -66,10 +66,31 @@ class HnswlibIndex : public VectorIndexIf<Vector, DistanceResult> {
}

Status Insert(VertexId vertex_id, const Vector& v) override {
CHECK_NOTNULL(hnsw_);
hnsw_->addPoint(v.data(), vertex_id);
return Status::OK();
}

Status SaveToFile(const std::string& file_path) const override {
try {
hnsw_->saveIndex(file_path);
} catch (std::exception& e) {
return STATUS_FORMAT(
IOError, "Failed to save Hnswlib index to file $0: $1", file_path, e.what());
}
return Status::OK();
}

Status AttachToFile(const std::string& file_path) override {
try {
hnsw_->loadIndex(file_path, space_.get());
} catch (std::exception& e) {
return STATUS_FORMAT(
IOError, "Failed to load Hnswlib index from file $0: $1", file_path, e.what());
}
return Status::OK();
}

std::vector<VertexWithDistance<DistanceResult>> Search(
const Vector& query_vector, size_t max_num_results) const override {
std::vector<VertexWithDistance<DistanceResult>> result;
Expand Down
Loading

0 comments on commit 6baf188

Please sign in to comment.