Power of 2 fixed size chunks#7934
Conversation
Codecov Report
@@ Coverage Diff @@
## master #7934 +/- ##
============================================
- Coverage 71.07% 71.05% -0.02%
- Complexity 4112 4132 +20
============================================
Files 1593 1595 +2
Lines 82372 82410 +38
Branches 12270 12270
============================================
+ Hits 58545 58560 +15
- Misses 19872 19896 +24
+ Partials 3955 3954 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
2c1e8cf to
418e77c
Compare
418e77c to
01ddc8b
Compare
...n/java/org/apache/pinot/segment/local/io/writer/impl/FixedByteChunkSVForwardIndexWriter.java
Outdated
Show resolved
Hide resolved
| */ | ||
| protected BaseChunkSVForwardIndexWriter(File file, ChunkCompressionType compressionType, int totalDocs, | ||
| int numDocsPerChunk, int chunkSize, int sizeOfEntry, int version) | ||
| int numDocsPerChunk, int chunkSize, int sizeOfEntry, int version, boolean fixed) |
There was a problem hiding this comment.
I don't think we need this extra fixed here for the version validation. Var-length V4 won't use this writer, but I don't think we should validate that here.
There was a problem hiding this comment.
I think it would be calamitous if this were used for variable length data by mistake, given that V4 for variable length data has a different layout. The only resolution would be to delete the data. So I felt it was important to validate.
|
Notes for merging: this will conflict with #7920 and has been kept separate because I tend to create a PR per change, but both need the benchmark. Once one of the two is merged I will need to rebase the benchmark in the other. |
|
What are the blockers here? |
This introduces a simple evolution of the raw index format for fixed width data types which merely enforces that chunk sizes are a power of 2. For example, if a chunk size of 1000 is chosen, the writer will round up to 1024. This allows the reader to assume that the chunk size is a power of 2 and replace integer remainder calculations and divisions with masks and shifts respectively. The format is otherwise identical.
This has a good impact when the index is compressed and the accesses are non-contiguous but there are many accesses per chunk: