Thrown together implementation of LLMZip from the paper https://arxiv.org/pdf/2306.04050.pdf
It is a lossless text compression algorithm that uses large language models. Large language models are good at compression due to their natural ability to predict the next token in a sequence. LLMZip uses this ability to compress text.
It works by tokenizing the input taking a sequence of the first 4 tokens and evaluating the generated logits for the next token. The logits are then sorted and used to determine the rank of the next token. Storing each rank in a sequence of ranks. Taking the original 4 tokens and the generated sequence of ranks this can then be compressed with traditional methods.
Decompression must be performed under the same model and conditions. Re-evaluating the original 4 tokens and selecting the logit by index from the sequence of ranks for that epoch. Storing this token into a token sequence which to be detokenized back into the original text lossless.
Different models will produce different results.
# prompt:
"""
BSD Zero Clause License
Copyright (c) [2023] [thefatcheetah]
Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted.
THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.A
"""
# model & outputs : bytes
original text : 672
text brotli compression : 351
# sequence of ranks > brotli compression : bytes
> openhermes-2.5-mistral-7b.Q4_K_M.gguf : 196
> openhermes-2.5-mistral-7b.Q2_K.gguf : 187
> juanako-7b-una.Q2_K.gguf : 179
> juanako-7b-una.Q4_K_M.gguf : 170
> open-llama-3b-v2-wizard-evol-instuct-v2-196k.Q3_K_S.gguf : 161
# ranks variable length encdoded bytes > brotli compression : bytes
> open-llama-3b-v2-wizard-evol-instuct-v2-196k.Q3_K_S.gguf : 123 *
https://jakevdp.github.io/PythonDataScienceHandbook/02.06-boolean-arrays-and-masks.html
Lossless Text Compression using Large Language Models - https://arxiv.org/abs/2306.04050- Chandra Shekhara Kaushik Valmeekam, Krishna Narayanan, Dileep Kalathil, Jean-Francois Chamberland, Srinivas Shakkottai
Variable-length integer encoding is a method of encoding integers in a way that uses less space. One common form of variable-length integer encoding is "Base 128 Varint" used in Google's Protocol Buffers. In this encoding, each byte of the integer is stored in 7 bits of an output byte, and the 8th bit is used as a continuation bit. The continuation bit is set to 1 for all bytes except the last, which signals the end of the number.