-
Notifications
You must be signed in to change notification settings - Fork 98
feat: add VarBin compressor to btrblocks string compressor #5360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Add VarBinScheme as a new compression option for string data in the btrblocks compressor. This scheme converts VarBinViewArray to the more compact VarBinArray representation using i32 offsets. The implementation: - Adds VarBinScheme struct and StringCode(1) - Uses ArrayAccessor for efficient iteration during conversion - Estimates compression ratio using sampling - Updates scheme codes sequentially (Dict: 1->2, FSST: 2->3, etc.) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> Co-Authored-By: Claude <noreply@anthropic.com>
Update VarBinScheme to use the canonical Arrow conversion path via to_arrow_preferred() instead of manual iterator-based conversion. This approach: - Uses the existing VarBinView -> Arrow conversion - Leverages Arrow's GenericByteArray -> VarBinArray conversion - Removes dependency on VarBinBuilder - Is more consistent with other conversion patterns in the codebase Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> Co-Authored-By: Claude <noreply@anthropic.com>
Change VarBinScheme to use into_arrow() with explicit Arrow DataType (Utf8 or Binary) instead of into_arrow_preferred() which returns StringView/BinaryView arrays. This ensures proper conversion: - VarBinViewArray -> Arrow GenericByteArray (Utf8/Binary) - Arrow GenericByteArray -> VarBinArray Also adds arrow-schema dependency to vortex-btrblocks. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> Co-Authored-By: Claude <noreply@anthropic.com>
CodSpeed Performance ReportMerging #5360 will not alter performanceComparing Summary
Benchmarks breakdown
Footnotes
|
Move VarBinScheme to StringCode(5) to avoid renumbering existing schemes: - UNCOMPRESSED_SCHEME: 0 (unchanged) - DICT_SCHEME: 1 (unchanged) - FSST_SCHEME: 2 (unchanged) - CONSTANT_SCHEME: 3 (unchanged) - SPARSE_SCHEME: 4 (unchanged) - VARBIN_SCHEME: 5 (new) This preserves backward compatibility with existing compressed data. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> Co-Authored-By: Claude <noreply@anthropic.com>
Benchmarks: Random AccessSummary
|
Benchmarks: TPC-H SF=1 on NVMESummary
Detailed Results Table
|
Benchmarks: FineWeb NVMeSummary
Detailed Results Table
|
Benchmarks: FineWeb S3Summary
Detailed Results Table
|
Benchmarks: TPC-H SF=1 on S3Summary
Detailed Results Table
|
Benchmarks: TPC-H SF=10 on NVMESummary
Detailed Results Table
|
Benchmarks: CompressionSummary
Detailed Results Table
|
Benchmarks: TPC-DS SF=1 on NVMESummary
Detailed Results Table
|
Benchmarks: TPC-H SF=10 on S3Summary
Detailed Results Table
|
vortex-btrblocks/src/string.rs
Outdated
| estimate_compression_ratio_with_sampling( | ||
| self, | ||
| stats, | ||
| is_sample, | ||
| allowed_cascading, | ||
| excludes, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can actually estimate this pretty easily by iterating the views, counting total bytes + offsets. That saves us actually having to materialize
Benchmarks: Clickbench on NVMESummary
Detailed Results Table
|
Benchmarks: Statistical and Population GeneticsSummary
Detailed Results Table
|
Replace sampling-based compression ratio estimation with actual size calculation comparing VarBinView and VarBin representations. Implementation: - VarBinView size: uses nbytes() for accurate measurement including views buffer and all data buffers - VarBin size: calculated from buffer sizes + offset array size (len * 4 bytes for i32 offsets) - Returns varbinview_size / varbin_size ratio This provides a more accurate estimate of compression benefit without the overhead of full conversion during estimation. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> Co-Authored-By: Claude <noreply@anthropic.com>
Fix spelling error in variable name. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> Co-Authored-By: Claude <noreply@anthropic.com>
Calculate VarBin offset array size based on the total string bytes to determine whether i32 or i64 offsets will be used: - If total string bytes < u32::MAX: use 4 bytes per offset (i32) - Otherwise: use 8 bytes per offset (i64) Offset array size = (array_len + 1) * offset_type_size This matches how VarBinArray::from_vec selects the offset type. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> Co-Authored-By: Claude <noreply@anthropic.com>
VarBinScheme is a format conversion utility, not a compression scheme. VarBinView is generally more efficient than VarBin for most workloads, especially with small strings that can be inlined (≤12 bytes). Return 0.0 from expected_compression_ratio to indicate this scheme should not be automatically selected by the compressor. It can still be used explicitly when VarBin format is specifically needed. This fixes test failures where VarBin was being selected and causing incompatibilities with Arrow operations that expect consistent types. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> Co-Authored-By: Claude <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…ferences Updated test expectations to account for fragments returning `string` instead of `string_view`: - Modified test_fragment_to_table to expect `pa.string()` instead of `pa.string_view()` - Updated test_get_fragments to cast fragment tables to match dataset schema before comparison - Fixed doctests in file.py and scan.py to expect `string` type instead of `string_view` This resolves test failures where PyArrow comparison failed due to different string encodings between dataset-level and fragment-level operations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
This will always kick in over uncompressed. We need to work out when we should include this. |
Summary
Adds
VarBinSchemeas a new compression option for string data in the btrblocks compressor. This scheme convertsVarBinViewArrayto the more compactVarBinArrayrepresentation using i32 offsets.Changes:
VarBinSchemestruct and registered it withStringCode(1)Schemetrait with compression ratio estimation via samplingArrayAccessorpattern for efficient VarBinView to VarBin conversionImplementation Details:
compressmethod usesVarBinBuilder<i32>for i32 offset arraysArrayAccessor::with_iteratorfor zero-copy iteration when possibleestimate_compression_ratio_with_samplingconsistent with other schemesTest plan
cargo build -p vortex-btrblockscargo clippy --all-targets --all-features -p vortex-btrblockscargo +nightly fmt --allbenchmarktag🤖 Generated with Claude Code