Skip to content

Latest commit

 

History

History
268 lines (216 loc) · 17 KB

README.md

File metadata and controls

268 lines (216 loc) · 17 KB

Logo

A Holistic Benchmark Evaluation for Neural-Network-Based Lossless Universal Compressors

Overview

NNLCB is a general-purpose (Universal) lossless compression algorithms benchmark test for multi-source data with deep neural networks. Currently, in our benchmark, we performed examinations on general-purpose lossless compressors, including 8 NN-based and 9 traditional compressors, using 28 datasets with differing type. Each loss-less compressor was evaluated on 19 performance measures, including compression robustness, compression strength, as well as time and peak memory required for compression and decompression, etc.

Benchmark Results

Performance comparison of different universal lossless compression tools on benchmark datasets

Algorithm WavgCR AvgCR WavgSSP AvgSSP CRP TotalCT TotalDT AvgCPM AvgDPM
(bits/base) (bits/base) (%) (%) (%) (Hour) (Hour) (GB) (GB)
NNCP 4.183 2.521 47.713 68.476 13.084 942.928 926.049 0.111 0.111
PAC 4.327 2.638 45.912 67.019 12.720 74.398 116.868 6.102 6.295
TRACE 4.411 2.718 44.867 66.032 12.486 69.128 131.110 6.106 6.449
DZip 4.494 2.516 43.819 68.545 14.272 332.787 148.374 10.113 4.790
DZip* 4.562 3.802 42.971 52.476 11.158 332.787 148.374 10.113 4.790
Lstm-compress 5.395 2.786 32.563 65.168 14.543 492.869 474.498 0.009 0.009
DeepZip* 16.835 7.045 -110.434 11.933 18.504 250.714 52.449 13.708 4.292
DeepZip 16.865 5.760 -110.811 28.003 24.092 250.714 52.449 13.708 4.292
BSC 4.826 2.928 39.677 63.394 13.045 0.353 0.300 0.121 0.116
Lzma2 4.912 3.122 38.590 60.967 12.289 0.584 0.030 1.264 0.427
XZ 4.923 3.118 38.463 61.021 12.365 0.879 0.040 1.612 0.504
PPMD 4.960 3.025 38.001 62.181 12.934 0.893 0.953 0.226 0.225
PBzip2 5.052 3.275 36.845 59.062 11.798 0.024 0.016 0.115 0.084
Gzip 5.351 3.862 33.113 51.728 10.342 0.451 0.026 0.002 0.002
LZ4-multi 5.618 4.280 29.770 46.501 9.656 0.064 0.009 0.116 0.025
SnZip 5.981 5.100 25.235 36.244 7.473 0.031 0.021 0.003 0.003

Notes. "*" : Consideration of NN Model Size; "Avg/WavgCR (bits/base)" : Average OR Weighted Average Compression Ratio; "TotalCT/DT (Hours)" : Total Compression OR Decompression Time; "AvgCPM/DPM (GB)" : Average Compression OR Decompression Peak Memory; "Avg/WavgSSP (%)" : Average OR Weighted Average Storage Saving Percentage; "CRP (%)" : Compression Robust Performance (%).

Benchmark Datasets

We benchmark on 28 widely studied datasets. These datasets contain various types of text, images, audio, genomic data, etc. Please refer to our paper for detailed information about the data, and the details of how to obtain each dataset are given below. The detailed link address of the benchmark datasets are as follows:

ID Name Data Type Size (Bytes) Description
D1 xml text 5345280 Files in xml format
D2 ooffice heterogeneous 6152192 Files consisting of Office programs
D3 reymont text 6627202 A pdf file with the contents of Reymont's book
D4 sao homogeneous 7251944 Files containing information of 258,996 stars
D5 x-ray image 8474240 12-bit grayscale scaled x-ray medical image of a child's hand
D6 mr image 9970564 A magnetic resonance medical image of the head
D7 osdb heterogeneous 10085684 Open source database files for testing
D8 dickens text 10192446 Text file consisting of multiple novels by Dickens
D9 samba heterogeneous 21606400 An collected open source project
D10 nci homogeneous 33553445 Files in SDF format
D11 webster heterogeneous 41458703 An English dictionary stored in HTML format
D12 mozilla homogeneous 51220480 The executable file Mozilla
D13 enwik8 text 100000000 First $2^8$ bytes of the English Wikipedia dump on 2006
D14 text8 text 100000000 First $2^8$ bytes of the English Wikipedia (only text) dump on 2006
D15 MNIST image 54880032 A widely studied dataset containing handwritten digital images
D16 CIFAR-10 image 186213868 A standard dataset of images with multiple categories
D17 ImageNet image 745823247 Training datasets in task3 from ISLVRC on 2012
D18 ImageTest image 470611702 A new 8-bit benchmark dataset for image compression evaluation
D19 Silesia heterogeneous 211938580 A heterogeneous corpus of 12 documents with various data types
D20 Backup heterogeneous 1000000000 $2^9$ bytes random extract from the disk backup of TRACE
D21 enwik9 text 1000000000 First $2^9$ bytes of the English Wikipedia dump on 2006
D22 Book text 1000000000 First $2^9$ bytes of BookCorpus
D23 ESC audio 220522000 First 500 audio files of the ESC
D24 Command audio 327759206 First 10,000 audio files of the Google Speech Commands Dataset
D25 LibriSpeech audio 359034309 Development set ("clean" speech) of LibriSpeech ASR corpus
D26 LJSpeech audio 293847664 First 10,000 audio files of the LJ Speech Dataset
D27 DNACorpus genome 685597124 A corpus of DNA sequences from 15 different species
D28 ERR7091247 genome 1926041160 A collection of genomics sequencing dataset with FastQ format

Algorithms Details

In our comparison examinations, we benchmarked 8 advanced general-purpose NN-based compressors Cmix, NNCP, Lstm-compress, DeepZip, DZip, TRACE, PAC, LLMZip and 9 traditional methods Gzip, PBzip2, XZ, BSC, SnZip, Lzma2, and PPMD, LZ4, and X3.

All experiments were conducted on a GPU server equipped with 4 * Intel Xeon Silver 4310 CPUs (2.10 GHz, 48 cores in total), 4* NVIDIA GeForce RTX 4090 GPUs (16,384 CUDA cores, 24 GB of GPU memory), and 128 GB of DDR4 RAM. The server runs the operating system Ubuntu 20.04.6 LTS.

Algorithms details and commands

Cmix is a neural network based lossless compression algorithm aimed at optimizing compression ratio at the cost of high CPU/memory usage, and it uses thousands of context models followed by an NN-based mixer. We used Cmix V19 to finish the experiments.

# compression
cmix -c file file.cmix
# decompression
cmix -d file.cmix file.cmix.out

LSTM-compressor is an LSTM-based lossless compression algorithm that uses the same LSTM module and preprocessing code as CMIX. LSTM-compress currently only supports compression of a single file. In this manuscript, we used LSTM-compress V3. The detailed commands are as follows.

# compression
lstm-compress -c file file.lstm
# decompression
lstm-compress -d file.lstm file.lstm.out

NNCP is a lossless compression algorithm based on LSTM and supports multi-GPU parallel processing. NNCP is an experiment to build a practical lossless data compressor with neural networks. The latest version uses a Transformer model. In this manuscript, we used NNCP V2021-06-01 to finish the experiments. The detailed commands are as follows.

# compression
nncp c file file.nncp -T 16 --cuda
# decompression
nncp d file.nncp file.nncp.out -T 16 --cuda

DeepZip is a general-purpose compression algorithm based on recurrent neural networks. It belongs to the static pre-training method. The detailed commands for using DeepZip are shown below.

# compression
sh ./compress.sh file file.deepzip bs model
# decompression
sh ./decompress.sh file.deepzip file.deepzip.out bs model

DZip is an upgraded version of DeepZip, with an extra deeper network added to DeepZip to improve compression. DZip includes two compression modes, combined mode and bootstrap mode. The detailed commands of DZip are as follows.

# compression
sh ./compress.sh file file.dzip com model
# decompression
sh ./decompress.sh file.dzip file.dzip.out com model

TRACE is a lossless compression algorithm based on Performer (a Transformer variant.) TRACE uses byte grouping and shared FFNs, and therefore has better execution efficiency. Since the original TRACE puts compression and decompression processes into simultaneous execution, in order to test the performance of compression and decompression separately, we have modified the source files to test the performance of compression or decompression separately.

# compression
python compressor.py --source file --comp file.trace
# decompression
python compressor.py --comp file.trace --decomp file.trace.out

PAC is a deep learning based compression algorithm fusing MLP and Ordered Mask. Due to the use of MLP, PAC has a lower computational cost. Again, we separate the compression-decompression process of PAC as shown in the command line below.

# compression
python compressor.py --source file --comp file.pac
# decompression
python compressor.py --comp file.pac --decomp file.pac.out

LLMZip uses LLaMA as probabilistic predictor in combination with entropy coding (zlib, Token-by-Token and arithmetic coding) to achieve general-purpose lossless compression. The compression and decompression commands of LLMZip are as follows.

# compression
torchrun --nproc_per_node 1 LLMzip_run.py --ckpt_dir llama2/llama-2-7b --tokenizer_path llama2/tokenizer.model --win_len 511 --text_file file --compression_folder file --encode_decode 0
# decompression
torchrun --nproc_per_node 1 LLMzip_run.py --ckpt_dir llama2/llama-2-7b --tokenizer_path llama2/tokenizer.model --win_len 511 --text_file file --compression_folder file --encode_decode 1

Gzip is a popular early general-purpose lossless compression program originally written by Jean-loup Gailly for the GNU project. The commands for Gzip are shown below.

# compression
gzip -c file > file.gz -9
# decompression
gzip file.gz -9

PBzip2 is a parallel implementation of the Bzip2 block-sorting file compression algorithm that uses pthreads and achieves near-linear speedup on SMP devices. PBzip2 utilizes the Burrows-Wheeler block sorting algorithm for compressing files, along with Huffman coding for efficient text compression. This manuscript uses parallel Bzip2 V1.1.13 to compress data.

# compression
pbzip2 -9 -m2000 -p16 -c file > file.bz2
# decompression
pbzip2 -dc -9 -p16 -m2000 file.bz2

XZ Utils is free general-purpose data compression software with a high compression ratio. XZ Utils were written for POSIX-like systems, but also work on some not-so-POSIX systems. XZ Utils are the successor to LZMA Utils. In our experiments, we used XZ V5.5.0. The compression and decompression commands are as follows.

# compression
pbzip2 -9 -m2000 -p16 -c file > file.bz2
# decompression
pbzip2 -dc -9 -p16 -m2000 file.bz2

BSC is a high-performance file compressor based on lossless block-ordered data compression algorithm, block-ordered data compression algorithm, high-performance file compressor. This manuscript uses BSC V3.3.2 to compress and decompress data.

# compression
bsc e file file.bsc -e2
# decompression
bsc d file.bsc file.bsc.out

SnZip is a traditional general-purpose lossless compression algorithm based on snappy. It supports a wide range of file formats including framing-format, old framing-format and so on. The default is framing-format. The command line to run SnZip is as follows.

# compression
snzip -k -t snzip file
# decompression
snzip -kd -t snzip file.snz

LZMA2 improves the multi-threading capability and performance of the LZMA algorithm and better handles incompressible data, so the compression performance is slightly improved. We also used the built-in LZMA2 algorithm in the 7-Zip application.

# compression
7zz a -m0=lzma2 -mx9 -mmt16 file.7z file
# decompression
7zz x -y -mx9 -mmt16 file.7z

PPMD is a context-based compressor, and its core idea is the Partial Matching Prediction (PPM) algorithm proposed by Cleary and Witten. PPM is a statistical modeling technique that uses a set of previous symbols in the input to predict the next symbol to reduce the output data’s entropy. PPM differs from a dictionary because PPM predicts the next symbol instead of trying to find the next symbol in the dictionary to encode. We utilized the PPMD in the 7-Zip to compress data.

# compression
7zz a -m0=ppmd -mx9 -mmt16 file.7z file
# decompression
7zz x -y -mx9 -mmt16 file.7z

LZ4 is a lossless compression algorithm with compression speeds greater than 500 MB/s per kernel (greater than 0.15 bytes/cycle). Its decoder is extremely fast, up to several GB/s (1 byte/cycle) per kernel. The latest lz4 algorithm supports multi-threaded versions.

# compression
lz4 -12 -T16 file file.lz4
# decompression
lz4 -12 -T16 -d file.lz4 file.lz4.reads
\end{lstlisting}

Experimental Configuration

All experiments were conducted on a GPU server equipped with 4 * Intel Xeon Silver 4310 CPUs (2.10 GHz, 48 cores in total), 4* NVIDIA GeForce RTX 4090 GPUs (16,384 CUDA cores, 24 GB of GPU memory), and 128 GB of DDR4 RAM. The server runs the operating system Ubuntu 20.04.6 LTS.

Additional Information

Source-Version-Date 2024.03.08. 2024.03.10.

Latest-Version-Date 2024.07.28.

Authors: NBJL-AIGroup.

Contact us: https://nbjl.nankai.edu.cn, sunh@nbjl.naikai.edu.cn, and mahd@nbjl.naikai.edu.cn

Change Log

2024.07.29: Modify the ReadMe file to include the X3 algorithm.