You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As can be seen, both relative_fst_offset and fst_size are defined as uint32.
// The byte offset of the Finite State Transducer (FST) data relative to the `base_offset`.
uint32 relative_fst_offset = 4;
// The size in bytes of the FST data.
uint32 fst_size = 5;
This implies that if the total size of the bitmaps exceeds 4.3GB, the relative_fst_offset will overflow, causing the FST to be located at an incorrect position when read.
Let's examine the possibility that the total bitmap size exceeds 4.3GB:
Suppose there are 100k distinct values in the column, each occupying an average of 43KB of bitmap size.
Since the bitmap is implemented with the simplest bitvec, 43KB actually represents a maximum of 344k segments, which is 352 million rows.
This is achievable, so this issue urgently needs to be addressed.
Some possible solutions are:
Optimize bitmap implementation, compress the bitmap, or adopt a roaring bitmap.
Introduce additional levels of indexing, SST -> Row Group -> Segment, to control the number of indexing targets.
Note: Simply adjusting relative_fst_offset to uint64 is insufficient because the Value that the FST maps to represents the position and size of the bitmap, which also uses u32. Therefore, the positioning of the bitmap will also be affected by the total size of the bitmap.
The text was updated successfully, but these errors were encountered:
zhongzc
changed the title
Meta of the inverted index overflows, causing the index to become unavailable
Metainfo of the inverted index may overflow, causing the index to become unavailable
May 20, 2024
In another aspect, maybe we should restrict the total row number of an SST file. We don't want to generate a very large parquet file and @v0y4g3r is working on this.
We may also skip building the inverted index for a column if the number of distinct is too large but I'm not sure what the appropriate threshold is.
Introduce additional levels of indexing, SST -> Row Group -> Segment, to control the number of indexing targets.
Maybe we can build an inverted index per row group since the number of rows inside a row group won't be too large. We need to enlarge the row group size (rows in a row group) to reduce the number of row groups per file.
What type of bug is this?
Unexpected error
What subsystems are affected?
Storage Engine
Problem Report
Below is the definition of
InvertedIndexMeta
:https://github.com/GreptimeTeam/greptime-proto/blob/a11db14b8502f55ca5348917fd18e6fcf140f55e/proto/greptime/v1/index/inverted_index.proto#L41-L68
As can be seen, both
relative_fst_offset
andfst_size
are defined asuint32
.According to the layout of the inverted index:
This implies that if the total size of the bitmaps exceeds 4.3GB, the
relative_fst_offset
will overflow, causing the FST to be located at an incorrect position when read.Let's examine the possibility that the total bitmap size exceeds 4.3GB:
This is achievable, so this issue urgently needs to be addressed.
Some possible solutions are:
Note: Simply adjusting
relative_fst_offset
touint64
is insufficient because the Value that the FST maps to represents the position and size of the bitmap, which also uses u32. Therefore, the positioning of the bitmap will also be affected by the total size of the bitmap.The text was updated successfully, but these errors were encountered: