-
Notifications
You must be signed in to change notification settings - Fork 15
Description
I am writing this here to capture the potential requirement, not that I have a usecase or plan to do this
There is something referred to as "FSST12" in the FastLanes File Format paper. The only real reference I know of is here: https://github.com/cwida/fsst
FSST12 is an alternative version of FSST that uses 12-bits symbols, and hence can encode up to 4096 symbols (of max 8 bytes long). It does not need an escaping mechanism as the first 256 codes are single-byte symbols consisting of only that byte. These symbols ensure that FSST12 can always find some symbol matching the next input, but a code is 1.5bytes (12 bits) and those symbols are 1 byte, so there is still compression loss when that happens (though in FSST8 the penalty for an escape is heavier 2x compression loss).
At the moment this repo only supports the "classic" 8 bits codes from the paper:
Lines 134 to 147 in 45aae6e
| /// A packed type containing a code value, as well as metadata about the symbol referred to by | |
| /// the code. | |
| /// | |
| /// Logically, codes can range from 0-255 inclusive. This type holds both the 8-bit code as well as | |
| /// other metadata bit-packed into a `u16`. | |
| /// | |
| /// The bottom 8 bits contain EITHER a code for a symbol stored in the table, OR a raw byte. | |
| /// | |
| /// The interpretation depends on the 9th bit: when toggled off, the value stores a raw byte, and when | |
| /// toggled on, it stores a code. Thus if you examine the bottom 9 bits of the `u16`, you have an extended | |
| /// code range, where the values 0-255 are raw bytes, and the values 256-510 represent codes 0-254. 511 is | |
| /// a placeholder for the invalid code here. | |
| /// | |
| /// Bits 12-15 store the length of the symbol (values ranging from 0-8). |
It would be interesting to also support FSST12 support in this crate