Detection-based compression that finds patterns zstd, brotli, and bzip2 miss.
Store the formula, not the data.
mzip detects mathematical structure in your data and compresses it optimally. Where other compressors see bytes, mzip sees patterns.
| Pattern Detected | mzip | Best Alternative | Advantage |
|---|---|---|---|
Sequential IDs (1, 2, 3, ...) |
32 bytes | bzip2: 3.4KB | 106x better |
| Repeating templates (JSON APIs) | 10KB | brotli: 49KB | 4.9x better |
| Audio PCM waveforms | 2.1KB | bzip2: 4KB | 2x better |
| Image gradients | 124 bytes | brotli: 397B | 3.2x better |
Result: 75% win rate across 250 tests (50 data types × 5 sizes) against 8 compressors including brotli, bzip2, xz, 7z, and rar.
| Data Category | Win Rate | Why |
|---|---|---|
| Numeric sequences | 100% | Formula compression: v[i] = a + b*i beats any LZ77 |
| Structured JSON/XML | 90% | Template extraction captures repeating structure |
| Audio/sensor data | 100% | Delta encoding exploits temporal correlation |
| Log files | 80% | Columnar separation + BWT on each column |
| Large files (>256KB) | 86% | More data = more patterns to detect |
| Scenario | Winner | Why |
|---|---|---|
| Text/code (small margins) | bzip2 | BWT tuning differences (typically <100 bytes) |
| Random/encrypted data | zstd | No patterns to detect, just raw entropy coding |
Most compressors treat all data as random bytes. But real data has structure:
| Pattern | Example | What mzip does |
|---|---|---|
| Sequential values | 1, 2, 3, 4, ... |
Store formula v[i] = start + i × step |
| Repeating templates | Same function 100x with different IDs | Store template once + variable list |
| Columnar data | Log files with fixed columns | Separate columns, compress each optimally |
| Audio samples | Smooth waveforms | Delta encoding exploits sample-to-sample correlation |
zstd-19 compresses 1MB of sequential IDs to 8KB. mzip compresses it to 32 bytes.
All benchmarks run on synthetic data generated by generators.hpp. Click sample links to download the exact input/output files.
| Compressor | Avg Ratio | Range | MB/s | Wins | Win% | Score | Rank |
|---|---|---|---|---|---|---|---|
| mzip | 8.16x | 1.0-32768x | 0.6 | 188 | 75.2% | 153.6 | 1 |
| bzip2:9 | 5.66x | 1.0-1001x | 0.6 | 63 | 25.2% | 39.5 | 2 |
| zstd:19 | 5.14x | 1.0-2641x | 1.4 | 30 | 12.0% | 21.3 | 3 |
| rar:m5 | 5.97x | 1.0-1014x | 2.6 | 0 | 0.0% | 6.6 | 4 |
| xz:9 | 5.89x | 1.0-997x | 2.3 | 0 | 0.0% | 6.4 | 5 |
| 7z:mx9 | 5.88x | 1.0-922x | 2.3 | 0 | 0.0% | 6.4 | 6 |
| gzip:9 | 4.78x | 1.0-240x | 0.8 | 0 | 0.0% | 4.7 | 7 |
Score = ratio × speed^0.1 × (1 + 0.1×wins). Total: 66.60 MB. lz4/snappy excluded (speed-focused).
| Compressor | Time (ms) | Speed (MB/s) |
|---|---|---|
| zstd | 96.4 | 690.8 |
| mzip | 3285.2 | 20.3 |
zstd decompresses 34.1x faster than mzip
| Size | Wins | Total | Win% |
|---|---|---|---|
| 4KB | 44 | 50 | 88.0% |
| 16KB | 27 | 50 | 54.0% |
| 64KB | 30 | 50 | 60.0% |
| 256KB | 38 | 50 | 76.0% |
| 1MB | 49 | 50 | 98.0% |
| Type | mzip | 2nd Best | Advantage |
|---|---|---|---|
| Database IDs (1MB) | 32B (32768x) | 3.4KB | 106.8x better |
| Timestamps (1MB) | 32B (32768x) | 2.7KB | 84.2x better |
| Database IDs (256KB) | 32B (8192x) | 937B | 29.3x better |
| Timestamps (256KB) | 32B (8192x) | 772B | 24.1x better |
| Database IDs (64KB) | 32B (2048x) | 301B | 9.4x better |
| Timestamps (64KB) | 32B (2048x) | 287B | 9.0x better |
| Image gradient (256KB) | 53B (4946x) | 323B | 6.1x better |
| Image gradient (64KB) | 39B (1680x) | 212B | 5.4x better |
| Timestamps (16KB) | 32B (512x) | 160B | 5.0x better |
| JSON API (1MB) | 10KB (104x) | 49KB | 4.9x better |
bzip2's BWT implementation occasionally beats mzip by small margins on text/code files.
| Type | mzip | Best | Gap |
|---|---|---|---|
| Metrics (1MB) | 121KB | bzip2: 120KB | +1KB |
| Nginx log (256KB) | 23.7KB | bzip2: 23.5KB | +250B |
| .env file (256KB) | 73KB | bzip2: 73KB | +184B |
| CSS (64KB) | 4.3KB | bzip2: 4.2KB | +106B |
| TOML config (4KB) | 1064B | bzip2: 969B | +95B |
| Natural text (4KB) | 741B | bzip2: 653B | +88B |
| INI config (64KB) | 10.4KB | bzip2: 10.3KB | +87B |
| Unicode text (256KB) | 7.8KB | bzip2: 7.7KB | +83B |
| Unicode text (16KB) | 1181B | bzip2: 1101B | +80B |
| Python (16KB) | 2585B | bzip2: 2505B | +80B |
| Type | 64KB | 256KB | 1MB | Samples |
|---|---|---|---|---|
| Timestamps | 32B vs 287B | 32B vs 772B | 32B vs 2.6KB | 64k 256k 1m |
| Database IDs | 32B vs 301B | 32B vs 937B | 32B vs 3.3KB | 64k 256k 1m |
| Integer array | 3.3KB vs 4.4KB | 12KB vs 17KB | 51KB vs 67KB | 64k 256k 1m |
| GPS coordinates | 9.7KB vs 11KB | 38KB vs 44KB | 154KB vs 179KB | 64k 256k 1m |
| Float temperature | 11KB vs 22KB | 40KB vs 87KB | 151KB vs 331KB | 64k 256k 1m |
| Sensor 16-bit | 26KB vs 27KB | 107KB vs 111KB | 430KB vs 445KB | 64k 256k 1m |
Format: mzip vs 2nd-best. Bold = winner.
| Type | 64KB | 256KB | 1MB | Samples |
|---|---|---|---|---|
| GraphQL queries | 2.8KB vs 2.8KB | 7.8KB vs 7.8KB | 26KB vs 28KB | 64k 256k 1m |
| SQL dump | 4.8KB vs 4.7KB | 15KB vs 15KB | 54KB vs 56KB | 64k 256k 1m |
| JSON API | 1016B vs 3.7KB | 2.8KB vs 12KB | 9.8KB vs 48KB | 64k 256k 1m |
| XML document | 1020B vs 2.2KB | 2.9KB vs 8.0KB | 10KB vs 29KB | 64k 256k 1m |
| CSV data | 7.1KB vs 9.8KB | 23KB vs 33KB | 88KB vs 122KB | 64k 256k 1m |
| Base64 data | 47KB vs 48KB | 189KB vs 192KB | 758KB vs 771KB | 64k 256k 1m |
Format: mzip vs 2nd-best. Bold = winner.
| Type | 64KB | 256KB | 1MB | Samples |
|---|---|---|---|---|
| JavaScript | 4.4KB vs 5.2KB | 11KB vs 12KB | 37KB vs 43KB | 64k 256k 1m |
| Python | 5.4KB vs 5.3KB | 12KB vs 12KB | 35KB vs 40KB | 64k 256k 1m |
| TypeScript | 4.4KB vs 4.4KB | 12KB vs 12KB | 43KB vs 45KB | 64k 256k 1m |
| HTML | 5.7KB vs 5.7KB | 17KB vs 18KB | 65KB vs 68KB | 64k 256k 1m |
| CSS | 4.2KB vs 4.1KB | 11KB vs 11KB | 41KB vs 43KB | 64k 256k 1m |
| Go | 3.4KB vs 3.4KB | 8.4KB vs 8.5KB | 26KB vs 28KB | 64k 256k 1m |
| Rust | 3.5KB vs 3.5KB | 9.1KB vs 9.1KB | 29KB vs 31KB | 64k 256k 1m |
| Java | 3.9KB vs 3.9KB | 10KB vs 10KB | 32KB vs 36KB | 64k 256k 1m |
| C | 5.2KB vs 5.2KB | 15KB vs 15KB | 50KB vs 54KB | 64k 256k 1m |
| Bash | 3.7KB vs 3.7KB | 10KB vs 10KB | 35KB vs 37KB | 64k 256k 1m |
| PHP | 3.3KB vs 3.3KB | 8.4KB vs 8.6KB | 25KB vs 27KB | 64k 256k 1m |
Format: mzip vs 2nd-best. Bold = winner.
| Type | 64KB | 256KB | 1MB | Samples |
|---|---|---|---|---|
| Docker Compose | 2.2KB vs 2.1KB | 5.5KB vs 5.6KB | 18KB vs 19KB | 64k 256k 1m |
| Terraform | 3.4KB vs 3.0KB | 11KB vs 10KB | 42KB vs 40KB | 64k 256k 1m |
| K8s manifests | 3.3KB vs 3.3KB | 7.6KB vs 7.6KB | 21KB vs 24KB | 64k 256k 1m |
| YAML config | 3.8KB vs 3.8KB | 11KB vs 11KB | 38KB vs 41KB | 64k 256k 1m |
Format: mzip vs 2nd-best. Bold = winner.
| Type | 64KB | 256KB | 1MB | Samples |
|---|---|---|---|---|
| Access log | 6.3KB vs 6.8KB | 21KB vs 24KB | 85KB vs 94KB | 64k 256k 1m |
| Nginx access log | 6.7KB vs 6.8KB | 22KB vs 22KB | 83KB vs 87KB | 64k 256k 1m |
Format: mzip vs 2nd-best. Bold = winner.
| Type | 64KB | 256KB | 1MB | Samples |
|---|---|---|---|---|
| Image gradient | 39B vs 212B | 53B vs 323B | 124B vs 397B | 64k 256k 1m |
| Audio PCM | 1.7KB vs 4.0KB | 1.7KB vs 4.0KB | 1.7KB vs 4.0KB | 64k 256k 1m |
| Sparse bitmap | 689B vs 880B | 2.6KB vs 3.0KB | 10KB vs 11KB | 64k 256k 1m |
| Protobuf-like | 40KB vs 41KB | 160KB vs 163KB | 640KB vs 650KB | 64k 256k 1m |
Format: mzip vs 2nd-best. Bold = winner.
| Type | 64KB | 256KB | 1MB | Samples |
|---|---|---|---|---|
| Natural text | 7.3KB vs 7.2KB | 28KB vs 28KB | 111KB vs 111KB | 64k 256k 1m |
| Markdown docs | 3.8KB vs 3.8KB | 11KB vs 11KB | 39KB vs 41KB | 64k 256k 1m |
| Email headers | 6.8KB vs 6.8KB | 21KB vs 21KB | 76KB vs 79KB | 64k 256k 1m |
| Unicode text | 2.5KB vs 2.4KB | 7.6KB vs 7.5KB | 27KB vs 28KB | 64k 256k 1m |
| Syslog | 8.8KB vs 9.4KB | 32KB vs 34KB | 126KB vs 133KB | 64k 256k 1m |
| Metrics | 7.8KB vs 7.8KB | 29KB vs 29KB | 118KB vs 117KB | 64k 256k 1m |
| JSON log | 7.2KB vs 8.0KB | 29KB vs 30KB | 116KB vs 122KB | 64k 256k 1m |
| Timestamps (jitter) | 14KB vs 15KB | 56KB vs 61KB | 224KB vs 244KB | 64k 256k 1m |
Format: mzip vs 2nd-best. Bold = winner.
| Type | 64KB | 256KB | 1MB | Samples |
|---|---|---|---|---|
| Makefile | 3.8KB vs 4.6KB | 12KB vs 16KB | 46KB vs 61KB | 64k 256k 1m |
| package.json | 4.8KB vs 4.8KB | 15KB vs 15KB | 55KB vs 57KB | 64k 256k 1m |
| Cargo.toml | 3.5KB vs 3.5KB | 10KB vs 10KB | 36KB vs 38KB | 64k 256k 1m |
Format: mzip vs 2nd-best. Bold = winner.
Benchmarked against 47 real files (7.2MB) from GitHub - actual source code from React, Linux kernel, Django, Bootstrap, and 20+ programming languages.
| File | Size | mzip | Best | Result |
|---|---|---|---|---|
| events.csv | 592KB | 7.07x | bzip2 7.05x | mzip +0.3% |
| lodash.js | 545KB | 7.69x | bzip2 7.69x | tie |
| app.log | 475KB | 7.72x | bzip2 7.70x | mzip +0.3% |
Brotli's 120KB static dictionary is optimized for common web/code patterns.
| File | Size | mzip | brotli | Gap |
|---|---|---|---|---|
| apache_log_sample.log | 2.3MB | 18.71x | 19.99x | -7% |
| bootstrap.css | 280KB | 10.83x | 11.45x | -6% |
| k8s_deployments.yaml | 22KB | 17.55x | 20.85x | -19% |
| terraform_main.tf | 6KB | 3.11x | 3.56x | -14% |
mzip excels on structured data with patterns (logs, CSV, JSON with templates, sequences). On general source code, brotli's pre-built dictionary gives it an edge.
| Category | Note |
|---|---|
| Numeric sequences | mzip wins 100% (formula compression) |
| Structured logs/CSV | mzip wins or ties (BWT competitive) |
| Small code (<30KB) | brotli wins 10-20% (dictionary) |
| Config files (K8s, TF) | brotli wins 15-20% (domain keywords) |
Key insight: mzip excels on structured/templated data (logs, CSV, repeated patterns). For small source code files, brotli's pre-built dictionary gives it an edge mzip can't match without shipping a dictionary.
mzip automatically detects the best strategy for your data:
| Strategy | What it does | Best for | Example ratio |
|---|---|---|---|
| LINEAR_GEN | Stores v[i] = a + b×i formula |
Sequential IDs, timestamps, counters | 32768x |
| NUMERIC | Delta/strided encoding | Audio PCM, sensor data, floats | 485x |
| COLUMNAR | Separates fixed-width columns | Access logs, nginx logs, CSV | 12x |
| SECTION_TEMPLATE | Extracts multi-line template + variables | Repeated code blocks with IDs | 100x |
| BWT_TEXT | Burrows-Wheeler Transform | General text, source code | 20x |
| RAW | Falls back to zstd-19 | Random/encrypted data | 1x |
// In ONE .cpp file:
#include <zstd.h>
#define MZIP_IMPLEMENTATION
#include "mzip_amalgamated.hpp"
// In other files:
#include "mzip_amalgamated.hpp"
// Usage
auto compressed = mzip::compress(data.data(), data.size());
auto decompressed = mzip::decompress(compressed.data(), compressed.size());#include <zstd.h> // Required: include zstd first
#include "mzip.hpp"
// Compress
std::vector<uint8_t> data = /* your data */;
auto compressed = mzip::compress(data.data(), data.size());
// Decompress
auto decompressed = mzip::decompress(compressed.data(), compressed.size());Requires C++17 and zstd:
# Single-header (no libsais.c needed - it's bundled)
g++ -O3 -march=native -I/path/to/zstd/include \
-L/path/to/zstd/lib -o mzip_cli mzip_cli.cpp -lzstd
# Separate headers
g++ -O3 -march=native -I/path/to/zstd/include \
-L/path/to/zstd/lib -o mzip_cli mzip_cli.cpp libsais.c -lzstd# Compress
./mzip_cli compress input.bin output.mzip
# Decompress
./mzip_cli decompress output.mzip restored.bin# Build benchmark tool
g++ -O3 -march=native -I./zstd/include -L./zstd/lib \
-o mzip_bench mzip_bench.cpp libsais.c -lzstd
# Run all benchmarks (46 types × 3 sizes)
./mzip_bench
# Quick test (64KB only)
./mzip_bench --quick
# Test specific type
./mzip_bench --type graphql| File | Description |
|---|---|
mzip.hpp |
Main library (include this) |
bwt_compress_*.hpp |
BWT implementations |
generators.hpp |
Test data generators |
libsais.c/h |
BWT suffix array (Apache 2.0) |
mzip_bench.cpp |
Benchmark tool |
mzip_cli.cpp |
Command-line interface |
samples/ |
Sample files at 64KB/256KB/1MB |
Dual Licensed: AGPL-3.0 OR Commercial
- AGPL-3.0: Free for open source. Service deployment requires source release.
- Commercial: Contact for proprietary use.
Third-party: libsais (Apache 2.0), stb_image (Public Domain), zstd (BSD, external)