Skip to content

Consider FSST12 (12 bit codes, avoid escape character) #142

@alamb

Description

@alamb

I am writing this here to capture the potential requirement, not that I have a usecase or plan to do this

There is something referred to as "FSST12" in the FastLanes File Format paper. The only real reference I know of is here: https://github.com/cwida/fsst

FSST12 is an alternative version of FSST that uses 12-bits symbols, and hence can encode up to 4096 symbols (of max 8 bytes long). It does not need an escaping mechanism as the first 256 codes are single-byte symbols consisting of only that byte. These symbols ensure that FSST12 can always find some symbol matching the next input, but a code is 1.5bytes (12 bits) and those symbols are 1 byte, so there is still compression loss when that happens (though in FSST8 the penalty for an escape is heavier 2x compression loss).

At the moment this repo only supports the "classic" 8 bits codes from the paper:

fsst/src/lib.rs

Lines 134 to 147 in 45aae6e

/// A packed type containing a code value, as well as metadata about the symbol referred to by
/// the code.
///
/// Logically, codes can range from 0-255 inclusive. This type holds both the 8-bit code as well as
/// other metadata bit-packed into a `u16`.
///
/// The bottom 8 bits contain EITHER a code for a symbol stored in the table, OR a raw byte.
///
/// The interpretation depends on the 9th bit: when toggled off, the value stores a raw byte, and when
/// toggled on, it stores a code. Thus if you examine the bottom 9 bits of the `u16`, you have an extended
/// code range, where the values 0-255 are raw bytes, and the values 256-510 represent codes 0-254. 511 is
/// a placeholder for the invalid code here.
///
/// Bits 12-15 store the length of the symbol (values ranging from 0-8).

It would be interesting to also support FSST12 support in this crate

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions