Bug: cannot open old database (created with 2.9.2) with new version (2.12.0) #423

zh217 · 2024-05-26T13:58:44Z

Describe the bug

Newer version cannot open database created with older versions of the library.

Steps to reproduce

With the python client version 2.12.0:

usearch.index.Index.restore(idx_path, view=True)

results in

ValueError: Unsupported metric!

where the datafile was created with version 2.9.2 with:

usearch.index.Index(ndim=1024, metric='ip')

Version 2.9.2 can open the datafile without problems.

On further testing, all versions from 2.10.0 onwards fail to open the database.

Expected behavior

Version 2.12.0 should be able to open database created with version 2.9.2, as the version numbers do not indicate any breaking changes.

USearch version

v2.12.0

Operating System

Ubuntu 22.04

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

No response

Are you open to being tagged as a contributor?

I am open to being mentioned in the project .git history as a contributor

Is there an existing issue for this?

I have searched the existing issues

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

ashvardanian · 2024-05-26T15:31:35Z

There were no changes in the file format, but the number of checks and assertions grew. Apparently, one of those checks is hurting us here.

Does it also fail if you create an arbitrary index, and then call .load - reinitializing it with a different file?

zh217 · 2024-05-26T17:06:19Z

It fails with a different error:

RuntimeError: Key type doesn't match, consider rebuilding

triggered by the following code:

idx = usearch.index.Index(ndim=1024, metric='ip')
idx.load(idx_path)

which runs fine if downgraded to 2.9.2.

ashvardanian · 2024-05-26T19:15:21Z

Interesting. Any chance the file was corrupted somewhere in between?

zh217 · 2024-05-27T02:58:22Z

No. Here is a minimal example that you can test:

# run with usearch-2.9.2 installed

import usearch.index

idx = usearch.index.Index(ndim=1024, metric='ip')
idx.save('index')

# run with usearch-2.12.0 installed

import usearch.index

# will throw an error in usearch-2.12.0
idx = usearch.index.Index.restore('index', view=True)

There's no need to insert anything into the database in order to trigger the error. Should be that the metadata in the old version is messed up.

zh217 · 2024-06-03T09:08:53Z

Update: this works both ways --- old version cannot open databases created by the new version either.

zh217 · 2024-06-03T09:57:54Z

There were no changes in the file format, but the number of checks and assertions grew. Apparently, one of those checks is hurting us here.

In fact the file format changed due to a subtle change in code.

Compare:

usearch/include/usearch/index_plugins.hpp

Lines 122 to 142 in 5ea48c8

    
           enum class scalar_kind_t : std::uint8_t { 
        
               unknown_k = 0, 
        
               // Custom: 
        
               b1x8_k = 1, 
        
               u40_k = 2, 
        
               uuid_k = 3, 
        
               // Common: 
        
               f64_k = 10, 
        
               f32_k = 11, 
        
               f16_k = 12, 
        
               f8_k = 13, 
        
               // Common Integral: 
        
               u64_k = 14, 
        
               u32_k = 15, 
        
               u16_k = 16, 
        
               u8_k = 17, 
        
               i64_k = 20, 
        
               i32_k = 21, 
        
               i16_k = 22, 
        
               i8_k = 23, 
        
           };

with:

usearch/include/usearch/index_plugins.hpp

Lines 128 to 148 in f79d818

    
           enum class scalar_kind_t : std::uint8_t { 
        
               unknown_k = 0, 
        
               // Custom: 
        
               b1x8_k, 
        
               u40_k, 
        
               uuid_k, 
        
               // Common: 
        
               f64_k, 
        
               f32_k, 
        
               f16_k, 
        
               f8_k, 
        
               // Common Integral: 
        
               u64_k, 
        
               u32_k, 
        
               u16_k, 
        
               u8_k, 
        
               i64_k, 
        
               i32_k, 
        
               i16_k, 
        
               i8_k, 
        
           };

so different versions interpret enums in the metadata differently.

As the metadata stored on disk also has version information, we can make new version of the library open old databases by mapping the old values to the new values. There seems to be no easy fix for the reverse direction, however.

As this definitely breaks compatibility between versions (affecting all f16, f32, f64 indices and all languages), this should be marked as a breaking change.

zh217 · 2024-06-03T10:14:50Z

We can localize the damage by changing what is returned by this function:

usearch/include/usearch/index_dense.hpp

Lines 176 to 236 in 5ea48c8

    
           inline index_dense_metadata_result_t index_dense_metadata_from_path(char const* file_path) noexcept { 
        
               index_dense_metadata_result_t result; 
        
               std::unique_ptr<std::FILE, int (*)(std::FILE*)> file(std::fopen(file_path, "rb"), &std::fclose); 
        
               if (!file) 
        
                   return result.failed(std::strerror(errno)); 
        
               // Read the header 
        
               std::size_t read = std::fread(result.head_buffer, sizeof(index_dense_head_buffer_t), 1, file.get()); 
        
               if (!read) 
        
                   return result.failed(std::feof(file.get()) ? "End of file reached!" : std::strerror(errno)); 
        
               // Check if the file immediately starts with the index, instead of vectors 
        
               result.config.exclude_vectors = true; 
        
               if (std::memcmp(result.head_buffer, default_magic(), std::strlen(default_magic())) == 0) 
        
                   return result; 
        
               if (std::fseek(file.get(), 0L, SEEK_END) != 0) 
        
                   return result.failed("Can't infer file size"); 
        
               // Check if it starts with 32-bit 
        
               std::size_t const file_size = std::ftell(file.get()); 
        
               std::uint32_t dimensions_u32[2]{0}; 
        
               std::memcpy(dimensions_u32, result.head_buffer, sizeof(dimensions_u32)); 
        
               std::size_t offset_if_u32 = std::size_t(dimensions_u32[0]) * dimensions_u32[1] + sizeof(dimensions_u32); 
        
               std::uint64_t dimensions_u64[2]{0}; 
        
               std::memcpy(dimensions_u64, result.head_buffer, sizeof(dimensions_u64)); 
        
               std::size_t offset_if_u64 = std::size_t(dimensions_u64[0]) * dimensions_u64[1] + sizeof(dimensions_u64); 
        
               // Check if it starts with 32-bit 
        
               if (offset_if_u32 + sizeof(index_dense_head_buffer_t) < file_size) { 
        
                   if (std::fseek(file.get(), static_cast<long>(offset_if_u32), SEEK_SET) != 0) 
        
                       return result.failed(std::strerror(errno)); 
        
                   read = std::fread(result.head_buffer, sizeof(index_dense_head_buffer_t), 1, file.get()); 
        
                   if (!read) 
        
                       return result.failed(std::feof(file.get()) ? "End of file reached!" : std::strerror(errno)); 
        
                   result.config.exclude_vectors = false; 
        
                   result.config.use_64_bit_dimensions = false; 
        
                   if (std::memcmp(result.head_buffer, default_magic(), std::strlen(default_magic())) == 0) 
        
                       return result; 
        
               } 
        
               // Check if it starts with 64-bit 
        
               if (offset_if_u64 + sizeof(index_dense_head_buffer_t) < file_size) { 
        
                   if (std::fseek(file.get(), static_cast<long>(offset_if_u64), SEEK_SET) != 0) 
        
                       return result.failed(std::strerror(errno)); 
        
                   read = std::fread(result.head_buffer, sizeof(index_dense_head_buffer_t), 1, file.get()); 
        
                   if (!read) 
        
                       return result.failed(std::feof(file.get()) ? "End of file reached!" : std::strerror(errno)); 
        
                   // Check if it starts with 64-bit 
        
                   result.config.exclude_vectors = false; 
        
                   result.config.use_64_bit_dimensions = true; 
        
                   if (std::memcmp(result.head_buffer, default_magic(), std::strlen(default_magic())) == 0) 
        
                       return result; 
        
               } 
        
               return result.failed("Not a dense USearch index!"); 
        
           }

Since the result is returned in various places inside the function, maybe it is best to add a method on index_dense_metadata_result_t to "upgrade" its version to the new enum by mutating its headers appropriately.

I can make a pull request for it if that's OK.

ashvardanian · 2024-06-03T14:13:11Z

Good catch @zh217! I think a good solution would be a custom function to convert enum to integer and vice-versa, with respect to the file version. Can you add it in index_plugins?

brittlewis12 · 2024-07-30T18:10:43Z

@ashvardanian any update on introducing the backwards compatibility for pre-2.10 indexes in #438?

this would be valuable to me to avoiding recomputing all old indexes, but if that’s not expected to be introduced I will go ahead and do that

ashvardanian · 2024-07-30T21:10:20Z

Working on it today.

* Improve: Swift test for issue #399 (#400) * Fix: Integer overflow in aligned-alloc Fixes ClickHouse/ClickHouse#61780 Co-authored-by: Antonio Andelic <antonio2368@users.noreply.github.com> * Make: Disable Windows NPM builds Relates to the #377 and the comment: #377 (comment) This temporarily disables the failing CI pipeline to generate and update docs. * Fix: Going beyond level 0 in clustering * Improve: Error handling in `index_dense_gt` This commit drops `std::vector` dependency, making compilation time shorter and error handling universal across abstraction layers. * Improve: Remove `std::function` calls * Improve: Remove `std::thread` from `index_dense_gt` * Improve: `std::vector` -> `buffer_gt` in plugins * Add: `usearch_change_threads_search` * Fix: `index_dense_t::make(path)` * Fix: Exhastive Search In the past, if we got "too lucky" traversing the graph, we could exit early before accumulating K top matches, even if the index had more than K entries. This patch changes that behavior, making output more predicatable. * Fix: Replacement leaves isolated nodes This patch addresses the issue #399, originally observed in the Swift layer. Reimplementing it in C++ helped locate the issue and lead to refactoring the `update` procedure in the lowest-layer `index_gt`. Now, `add` and `update` share less code. The `add` is one branch shorter (not that it would be noticeable), and `update` brings additional logic to avoid spilling `updated_slot` into top-results and consequently introducing self-loops. Closes #399 * Fix: Misc warnings & compilation issues * Fix: Misc warnings & compilation issues * Fix: Detect `ring_gt` being full Relates to #355 * Fix: Re-init `available_threads_` after load Both `view` and `load` would `reset` the thread contexts. After that, the very first `search` and `add` would fail, as no thread-local contexts are initialized. It would require a `reserve` call with a non-zero second arcgument to define the number of concurrent threads, for which the queues & buffers need to be allocated. That design is counter-intuitive, so this patch re-inits the same number of threads as before the `load` & `view` or one, if none existed. * Fix: `uint32_t` to `uint40_t` cast (#404) Co-authored-by: Ash Vardanian <1983160+ashvardanian@users.noreply.github.com> * Docs: Mention `b1` in `README.md` Co-authored-by: Adolfo Garcia <1250775+adolfogc@users.noreply.github.com> * Docs: Cover new users * Improve: Updates stability & catch bug * Fix: Dereferencing `member_iterator_t` * Add: Java `get` API (#407) * Fix: Compilation with `uint40_t` keys * Add: `AutoClosable` using `c_destroy` for Java (#408) * Fix: Rare deadlock on tiny collections * Improve: `enable_key_lookups=false` memory usage As noted by Robert Schulze, we can avoid populating `slot_lookup_` during insertions, if `enable_key_lookups` is not set. This would lead to lower memory consumption for large indexes of tiny vectors, particularly common in GIS. Co-authored-by: Robert Schulze <robert@clickhouse.com> * Fix: Preset `enable_key_lookups=true` in C * Fix: `std::is_pod` deprecated in C++20 * Fix: Unused type aliases * Improve: Avoid `#pragma region` pre GCC 13 (#386) * Do not use #pragma region if not supported by the compiler * `pragma region` supported by GCC 13+ https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85487 --------- Co-authored-by: Mikhail Bautin <mbautin@users.noreply.github.com> Co-authored-by: Ash Vardanian <1983160+ashvardanian@users.noreply.github.com> * Fix: Capacity checks in C# tests * Docs: Add doc-strings in C# * Make: Verbose C# logging in debug builds * Docs: Describe serialization in GoLang * Make: Drop Java & JS API references Compiling and introspecting docstrings in Sphinx is extremely flaky. It's safer to simply describe the usage patterns in the `README.md`. * Add: Load from path in Java (#410) * Improve: Avoid sorting on small "refines" * Docs: Cover JIT-compiled Python examples * Fix: `index.copy()` trying to `memcpy(*, NULL, *)` * Docs: UB in C++ Example (#415) Co-authored-by: Ash Vardanian <1983160+ashvardanian@users.noreply.github.com> * Improve: Catch UB with tests * Docs: Rearrange * Fix: Reserving contexts post-reload * Improve: Detect more failures in tests * Improve: Log failing lines * Fix: Clamp before down-casting to `i8_t` (#422) Previously we were down-casting floats to the target type (e.g. int8_t), and then clamping to [-100, 100] range. This means that e.g. 129 would be cast to -127 and then converted to -100, in stead of becoming 100 The fix does clamping first, and then casts the resulting number (which is guaranteed to be in range [-100, 100], due to clamping) from source type to target int8_t. Given the clamping, this will never overflow. --------- Co-authored-by: Ash Vardanian <1983160+ashvardanian@users.noreply.github.com> * Fix: Concurrency bug in high-K search When calling `index_dense_gt`, the thread lock was not propagating with the `search_result_t`. That is a an error-prone API. When too many threads are running in parallel (ideally, more than physical CPU cores) another thread may start reusing the `context_t` before the original caller finishes exporting entries with `dump_to`. This solution is backwards compatible and passes the tests. * Make: Manually bump version to 2.13.0 We can't yet rely on the SemVer tool semantic-release/release-notes-generator#633 (comment) * Improve: Attempt to implement batch-metrics * Add: Batch-capable metrics * Add: `misaligned_ptr_gt` comparators * Improve: Separate `error_t` constructors This makes debugging easier, as it's simpler to trace where the error message is being set. * Improve: Support `enum` slots Tracing implicit conversions of `std::uint32_t` and other primitive types isn't always easy in concurrent apps. This commit adds support for `enum` types to be used for safer implementation of `index_gt` specializations. * Improve: Ranges-V3 compatibility * Add: Preliminary support for batch metrics * Add: Batch-parallel refinements * Add: `MANIFEST.in` for `py.typed` (#425) Adding type annotation for Python native modules solves the `Skipping analyzing "usearch.index" module` warning due to `missing library stubs or py.typed marker`. Closes #424 * Fix: Clear cast buffer before bitwise ORs for `b1x8_t` (#428) When converting floating point arrays to binary, we use bitwise OR operations to set the relevant bits in the output buffer to 1. We do nothing if the bit is zero, so we assume that the bit is zero to start with. The `memset` statement makes sure this assumption holds. * Fix: `esm` duplicate import bug in `jest` (#420) Closes #418 Closes #426 * Fix: build.gradle deprecations * Fix: ESM build support (#433) Closes #426 Relates to #420 * Fix: `capacity()` assertion in Rust (#436) Closes #432 * Fix: Computing `stats(i).max_edges` * Add: Returning `computed_distances_in_refines` In high-connectivity graphs, the number of distance computations can be dominated by the number of "refine" heuristic computations performed by the core structure. The extended `add_result_t` now includes both: - `computed_distances_in_refines` - `computed_distances_in_reverse_refines` This commit also extends the documentation. * Add: Profiling attributes for `index_gt` * Fix: Preserve thread limits after `fork()` * Improve: Benchmarks self-recall support and `.bbin` This patch adds support for partial datasets, without ground truth neighborhood data. It also adds support for `.bbin` binary, `.hbin` half-precision, and `.dbin` double-precision input vector files/ * Fix: Printing progress between exit * Docs: Fix spelling * Fix: Wolfram bindings (#437) Co-authored-by: Ash Vardanian <1983160+ashvardanian@users.noreply.github.com> * Fix: Pre-reserve enough threads for C users This indirectly fixes the crash in C# layer * Revert: Parallel metrics * Fix: Updating the `entry_slot_` node * Improve: Enable single-threaded update tests * Make: Bump version * Fix: `flat_hash_multi_set_gt::reset` double-free * Fix: Not enough memory in `fork()` * Make: Bump to SimSIMD v5 * Fix: Missing SimSIMD v5 capability names * Improve: Detecting copy & move issues * Fix: Compilation w. explicit `template class` * Improve: Bypass UBSan NULL dereferencing warning * Improve: Minimize alignment issues * Make: Disable `bf16` on MacOS * Make: Link to GitHub repo * Fix: Conditional `call_key` compilation in MSVC * Fix: Unary minus on unsigned distances * Make: Disable `bf16` in JS * Fix: Compatibility with pre-v2.10 Closes #423 * Improve: Test wrong number of dimensions in Rust (#413) Closes #412 --------- Co-authored-by: Julius Brummack <juliusbrummack@icloud.com> * Fix: Handle wrong dimensionality in Rust * Fix: Overwriting `SIMSIMD_DYNAMIC_DISPATCH` * Make: Upgrade Java version * Fix: Spelling mistakes * Fix: `sprintf` deprecated * Fix: `-Wpass-failed=transform-warning` * Fix: Memory pinning on `Add` in C# * Make: Specify Java distribution * Fix: Pin memory in gets (C#) * Make: Skip `PersistAndRestore` in CI on MacOS * Make: Upgrade Docker action This fixes a GitHub CI warning about the deprecated NodeJS version. * Fix: `view_from_buffer` is unsafe in Rust Closes #453 Co-authored-by: Andrew Dirksen <2702854+bddap@users.noreply.github.com>" * Fix: `view_from_buffer` is unsafe in Rust Closes #453 Co-authored-by: Andrew Dirksen <2702854+bddap@users.noreply.github.com> * Make: Upgrade SimSIMD * Docs: Index header has no capacity The lack of capacity data is intended. Reserving memory is a non-persistent operation by nature, and we shouldn't save that metadata on the disk. Closes #452 Co-authored-by: Christopher Yim <4638193+GoodKnight@users.noreply.github.com> * Fix: Aggressive neighborhood checks on updates * Fix: `update()` bug detect with non-POD keys This bug was tough to spot. Apple Clang was the only one that caught it. The `-O0` flags were explicitly added to expose more symbols for debugging. More `uint40_t` tests were added. * Fix: Align vector type w index in C# (#456) Co-authored-by: Ash Vardanian <1983160+ashvardanian@users.noreply.github.com> * Make: Versioning in pre-release CI * Make: Switch from SemanticRelease to TinySemVer --------- Co-authored-by: Jaysen Marais <jaysen.marais@gmail.com> Co-authored-by: Antonio Andelic <antonio2368@users.noreply.github.com> Co-authored-by: Narek Galstyan <narekg@berkeley.edu> Co-authored-by: Adolfo Garcia <1250775+adolfogc@users.noreply.github.com> Co-authored-by: Trevor McCulloch <mccullocht@gmail.com> Co-authored-by: Robert Schulze <robert@clickhouse.com> Co-authored-by: Mikhail Bautin <552936+mbautin@users.noreply.github.com> Co-authored-by: Mikhail Bautin <mbautin@users.noreply.github.com> Co-authored-by: SheldonFung <88470100+SheldonFung98@users.noreply.github.com> Co-authored-by: James Braza <jamesbraza@gmail.com> Co-authored-by: cinchen <eryue0220@gmail.com> Co-authored-by: Mark Reed <markreed99@gmail.com> Co-authored-by: John <jjmace01@gmail.com> Co-authored-by: batracos <giulio.a@gmail.com> Co-authored-by: Ziyang Hu <2103823+zh217@users.noreply.github.com> Co-authored-by: Julius Brummack <133630819+jbrummack@users.noreply.github.com> Co-authored-by: Julius Brummack <juliusbrummack@icloud.com> Co-authored-by: Andrew Dirksen <2702854+bddap@users.noreply.github.com> Co-authored-by: Christopher Yim <4638193+GoodKnight@users.noreply.github.com> Co-authored-by: Britt <brittlewis12@gmail.com>

zh217 added the bug Something isn't working label May 26, 2024

zh217 added a commit to zh217/usearch that referenced this issue Jun 3, 2024

Compatibility with pre-2.10 versions unum-cloud#423

4b08b79

zh217 mentioned this issue Jun 3, 2024

Compatibility with pre-2.10 versions https://github.com/unum-cloud/us… #438

Merged

Stefano-t mentioned this issue Jun 26, 2024

Bug: 'Illegal Instruction' while restoring an index from disk #446

Closed

3 tasks

ashvardanian closed this as completed in da7a86c Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: cannot open old database (created with 2.9.2) with new version (2.12.0) #423

Bug: cannot open old database (created with 2.9.2) with new version (2.12.0) #423

zh217 commented May 26, 2024 •

edited

Loading

ashvardanian commented May 26, 2024

zh217 commented May 26, 2024

ashvardanian commented May 26, 2024

zh217 commented May 27, 2024 •

edited

Loading

zh217 commented Jun 3, 2024

zh217 commented Jun 3, 2024 •

edited

Loading

zh217 commented Jun 3, 2024 •

edited

Loading

ashvardanian commented Jun 3, 2024

brittlewis12 commented Jul 30, 2024

ashvardanian commented Jul 30, 2024

Bug: cannot open old database (created with 2.9.2) with new version (2.12.0) #423

Bug: cannot open old database (created with 2.9.2) with new version (2.12.0) #423

Comments

zh217 commented May 26, 2024 • edited Loading

Describe the bug

Steps to reproduce

Expected behavior

USearch version

Operating System

Hardware architecture

Which interface are you using?

Contact Details

Are you open to being tagged as a contributor?

Is there an existing issue for this?

Code of Conduct

ashvardanian commented May 26, 2024

zh217 commented May 26, 2024

ashvardanian commented May 26, 2024

zh217 commented May 27, 2024 • edited Loading

zh217 commented Jun 3, 2024

zh217 commented Jun 3, 2024 • edited Loading

zh217 commented Jun 3, 2024 • edited Loading

ashvardanian commented Jun 3, 2024

brittlewis12 commented Jul 30, 2024

ashvardanian commented Jul 30, 2024

zh217 commented May 26, 2024 •

edited

Loading

zh217 commented May 27, 2024 •

edited

Loading

zh217 commented Jun 3, 2024 •

edited

Loading

zh217 commented Jun 3, 2024 •

edited

Loading