You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/format/file/encoding.md
+55-3Lines changed: 55 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -548,6 +548,19 @@ options. However, they can also be set in the field metadata in the schema.
548
548
549
549
### Configuration Details
550
550
551
+
#### Compression Scheme
552
+
553
+
The `lance-encoding:compression` setting enables general-purpose compression algorithms to be applied. Available schemes:
554
+
555
+
-**`lz4`**: Fast compression with good compression ratios. Default compression level is fast mode.
556
+
-**`zstd`**: High compression ratios with configurable levels (0-22). Better compression than LZ4 but slower.
557
+
-**`none`**: No general compression applied (default).
558
+
-**`fsst`**: Fast Static Symbol Table compression for string data.
559
+
560
+
General compression is applied on top of other encoding techniques (RLE, BSS, bitpacking, etc.) to further reduce
561
+
data size. For mini-block layouts, compression is applied to entire mini-blocks. For full-zip layouts with large values
562
+
(≥32KiB), compression is automatically applied per-value.
563
+
551
564
#### Compression Level
552
565
553
566
The compression level is scheme dependent. Currently the following schemes support the following levels:
@@ -557,20 +570,52 @@ The compression level is scheme dependent. Currently the following schemes suppo
557
570
|`zstd`|[`zstd`](https://crates.io/crates/zstd)|`0-22`|`crate dependent` (3 as of this writing) |
558
571
|`lz4`|[`lz4`](https://crates.io/crates/lz4)| N/A | The LZ4 crate has two modes (fast and high compression) and currently this is not exposed to configuration. The LZ4 crate wraps a C library and the default is dependent on the C library. The default as of this writing is fast |
559
572
560
-
#### Run Length Encoding Threshold
573
+
Higher compression levels generally provide better compression at the cost of slower encoding speed. Decoding speed
574
+
is typically less affected by the compression level.
575
+
576
+
#### Run Length Encoding (RLE) Threshold
561
577
562
578
The RLE threshold is used to determine whether or not to apply run-length encoding. The threshold is a ratio
563
579
calculated by dividing the number of runs by the number of values. If the ratio is less than the threshold then
564
580
we apply run-length encoding. The default is 0.5 which means we apply run-length encoding if the number of runs
565
581
is less than half the number of values.
566
582
583
+
**Key points:**
584
+
- RLE is automatically selected when data has sufficient repetition (run_count / num_values < threshold)
0 commit comments