This repository was archived by the owner on Apr 4, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 83
Refactor the Facets databases to enable incremental indexing #619
Merged
Merged
Changes from all commits
Commits
Show all changes
60 commits
Select commit
Hold shift + click to select a range
c3f49f7
Prepare refactor of facets database
7913d63
Update Facets indexing to be compatible with new database structure
63ef0ab
Start porting facet distribution and sort to new database structure
b8a1caa
Add range search and incremental indexing algorithm
5a904cf
Reintroduce facet distribution functionality
6cc9182
Remove unused heed codec files
22d80ee
Reintroduce facet deletion functionality
39a4a0a
Reintroduce filter range search and facet extractors
bd2c0e1
Remove unused code
e570c23
Reintroduce asc/desc functionality
fb8d23d
Reintroduce db_snap! for facet databases
e8a156d
Reorganise facets database indexing code
d30c89e
Fix compile error+warnings in new tests
85824ee
Try to make facet indexing incremental
68cbcdf
Fix compile errors/warnings in http-ui and infos
6125224
Fix some facet indexing bugs
07ff92c
Add more snapshots from facet tests
36296bb
Add facet incremental indexing snapshot tests + fix bug
a7201ec
cargo fmt
afdf87f
Fix bugs in asc/desc criterion and facet indexing
079ed4a
Add more snapshots
982efab
Fix encoding bugs in facet databases
3d145d7
Merge the two <facetttype>_faceted_documents_ids methods into one
9b55e58
Add FacetsUpdate type that wraps incremental and bulk indexing methods
485a723
Refactor facet-related codecs
330c9eb
Rename facet codecs and refine FacetsUpdate API
9026867
Give same interface to bulk and incremental facet indexing types
b2f01ad
Refactor facet database tests
bee3c23
Add comparison benchmark between bulk and incremental facet indexing
27454e9
Document and refine facet indexing algorithms
fca4577
Return original string in facet distributions, work on facet tests
3d7ed32
Fix bug in string facet distribution with few candidates
b1ab091
Remove outdated TODOs
985a94a
cargo fmt
de52a9b
Improve documentation of some facet-related algorithms
86d9f50
Fix bugs in incremental facet indexing with variable parameters
3baa34d
Fix compiler errors/warnings
cb8442a
Further unify facet databases of f64s and strings
51961e1
Polish some details
1ecd3bb
Fix bug in FieldDocIdFacetCodec
a2270b7
Change fuzzcheck dependency to point to git repository
d010962
Fix a bug in facet_range_search and add documentation
0ade699
Don't crash when failing to decode using StrRef codec
1165ba2
Make facet deletion incremental
a034a1e
Move StrRefCodec and ByteSliceRefCodec to their own files
loiclec acc8cae
Add link to GitHub PR to document of update/facet module
loiclec 2295e0e
Use real delete function in facet indexing fuzz tests
loiclec ee1abfd
Ignore files generated by fuzzcheck
loiclec d885de1
Add option to avoid soft deletion of documents
ab5e56f
Add document deletion snapshot tests and tests for hard-deletion
e3ba1fc
Make deletion tests for both soft-deletion and hard-deletion
f198b20
Add facet deletion tests that use both the incremental and bulk methods
loiclec 206a3e0
cargo fmt
loiclec 14ca804
Add some documentation on how to run the facet db fuzzer
loiclec 3b1f908
Revert behaviour of facet distribution to what it was before
loiclec b7f2428
Fix formatting and warning after rebasing from main
loiclec 2741756
Merge remote-tracking branch 'origin/main' into facet-levels-refactor
loiclec 631e991
Depend on released version of fuzzcheck from crates.io
loiclec 2fa85a2
Remove outdated files from http-ui/ and infos/
loiclec 54c0cf9
Merge remote-tracking branch 'origin/main' into facet-levels-refactor
loiclec File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| use std::borrow::Cow; | ||
|
|
||
| use heed::{BytesDecode, BytesEncode}; | ||
|
|
||
| /// A codec for values of type `&[u8]`. Unlike `ByteSlice`, its `EItem` and `DItem` associated | ||
| /// types are equivalent (= `&'a [u8]`) and these values can reside within another structure. | ||
| pub struct ByteSliceRefCodec; | ||
|
|
||
| impl<'a> BytesEncode<'a> for ByteSliceRefCodec { | ||
| type EItem = &'a [u8]; | ||
|
|
||
| fn bytes_encode(item: &'a Self::EItem) -> Option<Cow<'a, [u8]>> { | ||
| Some(Cow::Borrowed(item)) | ||
| } | ||
| } | ||
|
|
||
| impl<'a> BytesDecode<'a> for ByteSliceRefCodec { | ||
| type DItem = &'a [u8]; | ||
|
|
||
| fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> { | ||
| Some(bytes) | ||
| } | ||
| } |
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
50 changes: 0 additions & 50 deletions
50
milli/src/heed_codec/facet/facet_string_level_zero_codec.rs
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,90 +0,0 @@ | ||
| use std::borrow::Cow; | ||
| use std::convert::TryInto; | ||
| use std::{marker, str}; | ||
|
|
||
| use crate::error::SerializationError; | ||
| use crate::heed_codec::RoaringBitmapCodec; | ||
| use crate::{try_split_array_at, try_split_at, Result}; | ||
|
|
||
| pub type FacetStringLevelZeroValueCodec = StringValueCodec<RoaringBitmapCodec>; | ||
|
|
||
| /// A codec that encodes a string in front of a value. | ||
| /// | ||
| /// The usecase is for the facet string levels algorithm where we must know the | ||
| /// original string of a normalized facet value, the original values are stored | ||
| /// in the value to not break the lexicographical ordering of the LMDB keys. | ||
| pub struct StringValueCodec<C>(marker::PhantomData<C>); | ||
|
|
||
| impl<'a, C> heed::BytesDecode<'a> for StringValueCodec<C> | ||
| where | ||
| C: heed::BytesDecode<'a>, | ||
| { | ||
| type DItem = (&'a str, C::DItem); | ||
|
|
||
| fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> { | ||
| let (string, bytes) = decode_prefix_string(bytes)?; | ||
| C::bytes_decode(bytes).map(|item| (string, item)) | ||
| } | ||
| } | ||
|
|
||
| impl<'a, C> heed::BytesEncode<'a> for StringValueCodec<C> | ||
| where | ||
| C: heed::BytesEncode<'a>, | ||
| { | ||
| type EItem = (&'a str, C::EItem); | ||
|
|
||
| fn bytes_encode((string, value): &'a Self::EItem) -> Option<Cow<[u8]>> { | ||
| let value_bytes = C::bytes_encode(value)?; | ||
|
|
||
| let mut bytes = Vec::with_capacity(2 + string.len() + value_bytes.len()); | ||
| encode_prefix_string(string, &mut bytes).ok()?; | ||
| bytes.extend_from_slice(&value_bytes[..]); | ||
|
|
||
| Some(Cow::Owned(bytes)) | ||
| } | ||
| } | ||
|
|
||
| pub fn decode_prefix_string(value: &[u8]) -> Option<(&str, &[u8])> { | ||
| let (original_length_bytes, bytes) = try_split_array_at(value)?; | ||
| let original_length = u16::from_be_bytes(original_length_bytes) as usize; | ||
| let (string, bytes) = try_split_at(bytes, original_length)?; | ||
| let string = str::from_utf8(string).ok()?; | ||
| Some((string, bytes)) | ||
| } | ||
|
|
||
| pub fn encode_prefix_string(string: &str, buffer: &mut Vec<u8>) -> Result<()> { | ||
| let string_len: u16 = | ||
| string.len().try_into().map_err(|_| SerializationError::InvalidNumberSerialization)?; | ||
| buffer.extend_from_slice(&string_len.to_be_bytes()); | ||
| buffer.extend_from_slice(string.as_bytes()); | ||
| Ok(()) | ||
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { | ||
| use heed::types::Unit; | ||
| use heed::{BytesDecode, BytesEncode}; | ||
| use roaring::RoaringBitmap; | ||
|
|
||
| use super::*; | ||
|
|
||
| #[test] | ||
| fn deserialize_roaring_bitmaps() { | ||
| let string = "abc"; | ||
| let docids: RoaringBitmap = (0..100).chain(3500..4398).collect(); | ||
| let key = (string, docids.clone()); | ||
| let bytes = StringValueCodec::<RoaringBitmapCodec>::bytes_encode(&key).unwrap(); | ||
| let (out_string, out_docids) = | ||
| StringValueCodec::<RoaringBitmapCodec>::bytes_decode(&bytes).unwrap(); | ||
| assert_eq!((out_string, out_docids), (string, docids)); | ||
| } | ||
|
|
||
| #[test] | ||
| fn deserialize_unit() { | ||
| let string = "def"; | ||
| let key = (string, ()); | ||
| let bytes = StringValueCodec::<Unit>::bytes_encode(&key).unwrap(); | ||
| let (out_string, out_unit) = StringValueCodec::<Unit>::bytes_decode(&bytes).unwrap(); | ||
| assert_eq!((out_string, out_unit), (string, ())); | ||
| } | ||
| } | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.