Skip to content

Ed-von-Schleck/shoco

Repository files navigation

shoco

shoco compresses short strings to be even smaller. It is very fast, but the most remarkable property is: If your string is ASCII, shoco guarantees that the compressed string will never be bigger than the input string. In fact, an ASCII string is proper input for the decompressor (and of course, it decompresses to the exact same string). shoco comes with sane defaults, but includes the tools to train the compressor for your specific type of input data. The maximum compression rate is 50%, but usually in the range of 30%-35% (although if you have very special data, and trained shoco accordingly, it might even be better).

shoco's focus lies less on maximum compression (on might want to device an arithmetic coder or something similar to achieve this), but on speed and simplicity of code. Even when compressing and decompressing millions of strings, you should not experience a noticable impact as compared to using plain strings.

Installation

Simply copy shoco.c, shoco.h and shoco_table.h over to your project. Then, include shoco.h to use it.

The standard table is trained for optimal compression of common english words. If that isn't your typical input, you might want to copy one of the other table files from the tables directory and rename it to shoco_table.h. If none of the tables work well for you, you can use the generate_successor_table.py script to generate a table file that's better suited to your input (provided you have good training data). For details on table generation, see below.

API

size_t shoco_compress(const char * const in, size_t len, char * const out, size_t bufsize);
size_t shoco_decompress(const char * const in, size_t len, char * const out, size_t bufsize);

That's it. If the len argument for shoco_compress is 0, the input char is assumed to be \0-terminated. If it's a positive integer, parsing the input will stop after this length, or at a \0-char, whatever comes first. shoco_decompress however will need a positive integer for len (most likely you should pass the return value of shoco_compress).

The return value is the number of bytes written. If it is less than bufsize, all is well. In case of decompression, a \0-terminator is written. If the return value is exactly bufsize, the output is all there, but not \0-terminated. It is up to you to decide if that's an error or not. If the buffer is not large enough for the output, the return value will be bufsize + 1. You might want to allocate a bigger output buffer. The compressed string will never be \0-terminated.

If you are sure that the input data is plain ASCII, your out buffer for shoco_compress only needs to be as large as the input string. Otherwise, the output buffer may need to be up to 2x as large as the input, if it's a 1-byte encoding, or even larger for multi-byte encodings like UTF-8.

For the standard values of shoco, maximum compression is 50%, so the out buffer for shoco_decompress needs to be a maximum of twice the size of the compressed string.

Generating Tables

It's easy to generate tables suited for your kind of data: shoco comes with a script that takes your training data (one or more files, or stdin if none are provided), and outputs a header file suitable as a replacement for the included shoco_table.h. An example that trains against a dictionary (btw., not the best kind of training data, because it's dominated by uncommon words):

$ ./generate_successor_table.py /usr/share/dict/words -o shoco_table.h

There are options on how to chunk and strip the input data – for example, if we want to train shoco with the words in this readme, but without punctuation and whitespace, we could do

./generate_successor_table.py README.md --split=whitespace --strip=punctuation

Since we haven't specified an output file, the resulting table file is printed on stdout.

generate_successor_table.py --help prints a friendly help message:

usage: generate_successor_table.py [-h] [-o OUTPUT]
                                   [--split {newline,whitespace,none}]
                                   [--strip {whitespace,punctuation,none}]
                                   [--max-leading-char-bits MAX_LEADING_CHAR_BITS]
                                   [--max-successor-bits MAX_SUCCESSOR_BITS]
                                   [--encoding-types {1,2,3}]
                                   [--optimize-encoding]
                                   [file [file ...]]

Generate a succession table for 'shoco'.

positional arguments:
  file                  The training data file(s). If no input file is
                        specified, the input is read from STDIN.

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output file for the resulting succession table.
  --split {newline,whitespace,none}
                        Split the input into chunks at this separator.
                        Default: newline
  --strip {whitespace,punctuation,none}
                        Remove leading and trailing characters from each
                        chunk. Default: whitespace

table and encoding generation arguments:
  Higher values may provide for better compression ratios, but will make
  compression/decompression slower. Likewise, lower numbers make
  compression/decompression faster, but will likely make hurt the
  compression ratio. The default values are mostly a good compromise.

  --max-leading-char-bits MAX_LEADING_CHAR_BITS
                        The maximum amount of bits that may be used for
                        representing a leading character. Default: 5
  --max-successor-bits MAX_SUCCESSOR_BITS
                        The maximum amount of bits that may be used for
                        representing a successor character. Default: 4
  --encoding-types {1,2,3}
                        The number of different encoding schemes. If your
                        input strings are very short, consider lower values.
                        Default: 3
  --optimize-encoding   Find the optimal packing structure for the training
                        data. This rarely leads to different results than the
                        default values, and it is *slow*. Use it for very
                        unusual input strings, or when you use non-default
                        table generation arguments.

Since generating tables can be slow if your input data is large, and especially so if you use the --otimize-encoding option, using pypy can significantly speed up the process.

Comparison With Other Compressors

smaz

There's another good small string compressor out there: smaz. Smaz seems to be dictionary based, while shoco is an entropy encoder. As a result, smaz will often do better than shoco when compressing common english terms. However, shoco typically beats smaz for more obscure input, as long as it's ASCII. Smaz may enlarge your string for uncommon words (like numbers), shoco will never do that for ASCII strings.

Performance-wise, shoco is typically faster by at least a factor of 2. As an example, compressing and decompressing all words in /usr/dict/share/words with smaz takes around 0.325s on my computer and compresses on average by 28%, while shoco has a compression average of 33% (with the standard table; an optimized table will be even better) and takes around 0.140s. shoco is especially fast at decompression.

shoco can be trained with user data, while smaz's dictionary is built-in. That said, the maximum compression rate of smaz is hard to reach for shoco, so depending on your input type, you might fare better with smaz (there's no way around it: You have to measure it yourself).

gzip, xz

shoco's compression ratio can't (and doesn't want to) compete with gzip et al. for string sizes bigger than, say, a hundred bytes or so. But for strings up to that size, shoco is a good contender, and for very very small strings, it will always be better than standard compressors.

The performance of shoco should always be several times faster than about any standard compression tool. For testing purposes, there's a binary called shoco included that compresses and decompresses single files. The following timings were made with this command line tool. The data is /usr/share/dict/words (size: 4,953,680), compressing it as a whole (not a strong point of shoco):

               | shoco     | gzip      | xz

-------------------|-----------|-----------|------- compression time | 0.070s | 0.470s | 3.300s decompression time | 0.010s | 0.048s | 0.148s compressed size | 3,393,975 | 1,476,083 | 1,229,980

This demonstates quite clearly that shoco's compression rate sucks, but also that it's very fast.

Use Cases

As of now, there are no known uses of shoco in real-life projects. If you do use shoco, I would love to hear about it! Possible use cases might include i18n tools like gettext (strings appearing in GUIs tend to be rather short, and should compress quite well), or transfering short messages over a slow network (Twitter?), especially if the cpus on either side are too undepowered to run full-blown compressors (like embedded devices sometimes are).

To Do

shoco is stable, and it works well – but I'd have only tested it with gcc on x86_64 Linux. Feedback on how it runs on other OSes, compilers and architectures would be highly appreciated! If it fails, it's a bug (and given the size of the project, it should be easy to fix). Other than that, there's a few issues that could stand some improvements:

  • There should be more tests, because there's never enough tests. Ever. Patches are very welcome!
  • Tests should include table generation. As that involves re-compilation, these should probably written as a Makefile, or in bash or Python (maybe using ctypes to call the shoco-functions directly).
  • The Python script for table generation should see some clean-up, as well as documentation. Also it should utilize all cpu cores (presumably via the multiprocess-module). This is a good task for new contributers!
  • Again for table generation: Investigate why pypy isn't as fast as should be expected (jitviewer might be of help here).
  • Generate Javascript code with emscripten, so that shoco can be used with web services.
  • The current SSE2 optimization is probably not optimal. Anyone who loves to tinker with these kinds of micro-optimizations is invited to try his or her hand here.

Feedback

If you use shoco, or like it for whatever reason, I'd really love to hear from you! Also, a nice way of saying thanks is to support me financially via git tip or flattr.

License

shoco is published under the MIT license; see the LICENSE file for details.

Authors

shoco is written by Christian Schramm.