Skip to content

Commit b22a548

Browse files
authored
docs: add docs for newly added options (#4976)
This PR will add docs for newly added options about encoding added in 2.1 format.
1 parent 17aa141 commit b22a548

File tree

1 file changed

+55
-3
lines changed

1 file changed

+55
-3
lines changed

docs/src/format/file/encoding.md

Lines changed: 55 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -548,6 +548,19 @@ options. However, they can also be set in the field metadata in the schema.
548548

549549
### Configuration Details
550550

551+
#### Compression Scheme
552+
553+
The `lance-encoding:compression` setting enables general-purpose compression algorithms to be applied. Available schemes:
554+
555+
- **`lz4`**: Fast compression with good compression ratios. Default compression level is fast mode.
556+
- **`zstd`**: High compression ratios with configurable levels (0-22). Better compression than LZ4 but slower.
557+
- **`none`**: No general compression applied (default).
558+
- **`fsst`**: Fast Static Symbol Table compression for string data.
559+
560+
General compression is applied on top of other encoding techniques (RLE, BSS, bitpacking, etc.) to further reduce
561+
data size. For mini-block layouts, compression is applied to entire mini-blocks. For full-zip layouts with large values
562+
(≥32KiB), compression is automatically applied per-value.
563+
551564
#### Compression Level
552565

553566
The compression level is scheme dependent. Currently the following schemes support the following levels:
@@ -557,20 +570,52 @@ The compression level is scheme dependent. Currently the following schemes suppo
557570
| `zstd` | [`zstd`](https://crates.io/crates/zstd) | `0-22` | `crate dependent` (3 as of this writing) |
558571
| `lz4` | [`lz4`](https://crates.io/crates/lz4) | N/A | The LZ4 crate has two modes (fast and high compression) and currently this is not exposed to configuration. The LZ4 crate wraps a C library and the default is dependent on the C library. The default as of this writing is fast |
559572

560-
#### Run Length Encoding Threshold
573+
Higher compression levels generally provide better compression at the cost of slower encoding speed. Decoding speed
574+
is typically less affected by the compression level.
575+
576+
#### Run Length Encoding (RLE) Threshold
561577

562578
The RLE threshold is used to determine whether or not to apply run-length encoding. The threshold is a ratio
563579
calculated by dividing the number of runs by the number of values. If the ratio is less than the threshold then
564580
we apply run-length encoding. The default is 0.5 which means we apply run-length encoding if the number of runs
565581
is less than half the number of values.
566582

583+
**Key points:**
584+
- RLE is automatically selected when data has sufficient repetition (run_count / num_values < threshold)
585+
- Supported types: All fixed-width primitives (u8, i8, u16, i16, u32, i32, f32, u64, i64, f64)
586+
- Maximum chunk size: 2048 values per mini-block
587+
- Setting threshold to `0.0` effectively disables RLE
588+
- Setting threshold to `1.0` makes RLE very aggressive (used whenever any runs exist)
589+
590+
RLE is particularly effective for:
591+
- Sorted or partially sorted data
592+
- Columns with many repeated values (status codes, categories, etc.)
593+
- Low-cardinality columns
594+
567595
#### Byte Stream Split (BSS)
568596

569597
The configuration variable for BSS is a simple enum. A value of `off` means to never apply BSS, a value of `on`
570598
means to always apply BSS, and a value of `auto` means to apply BSS based on an entropy calculation (see code for
571599
details).
572600

573-
BSS is only applied when the `lance-encoding:compression` variable is also set (to a non-`none` value).
601+
**Important:** BSS is only applied when the `lance-encoding:compression` variable is also set (to a non-`none` value).
602+
BSS is a data transformation that makes floating-point data more compressible; it does not reduce size on its own.
603+
604+
**Key points:**
605+
- Supported types: Only 32-bit and 64-bit data (f32, f64, timestamps)
606+
- Maximum chunk sizes: 1024 values (f32), 512 values (f64)
607+
- `auto` mode: Uses entropy analysis with 0.5 sensitivity threshold
608+
- `on` mode: Always applies BSS for supported types
609+
- `off` mode: Never applies BSS
610+
611+
BSS works by splitting multi-byte values by byte position, creating separate byte streams. This clusters similar
612+
bits together (especially mantissa bits in floating-point numbers), which general compression algorithms can then
613+
compress more effectively.
614+
615+
BSS is particularly effective for:
616+
- Floating-point measurements with similar ranges
617+
- Time-series data with consistent precision
618+
- Scientific data with correlated mantissa patterns
574619

575620
#### Dictionary Divisor
576621

@@ -580,8 +625,15 @@ threshold. If the number of unique values is less than the threshold then we app
580625
configuration variable defines the divisor that we apply and it defaults to 2 which means we apply dictionary
581626
encoding if we estimate that less than half the values are unique.
582627

628+
Dictionary encoding is effective for columns with low cardinality where the same values repeat many times.
629+
The dictionary is stored once per page and indices are stored in place of the actual values.
630+
583631
This is likely to change in future versions.
584632

585633
#### Packed Struct Encoding
586634

587-
Packed struct encoding is a semi-structural transformation described above.
635+
Packed struct encoding is a semi-structural transformation described above. When enabled, struct values are stored
636+
in row-major format rather than the default columnar format. This reduces the number of I/O operations needed for
637+
random access but prevents reading individual fields independently.
638+
639+
This is always opt-in and should only be used when all struct fields are typically accessed together.

0 commit comments

Comments
 (0)