GCX is grammar compressor based on the induced suffix ordering technique introduced in GCIS. Our method builds on the factorization of the DC3 algorithm to create the rules of a straight line grammar for the input string.
GCX supports very fast extraction operation on the encoded grammar without the need to decompress the complete string.
This repository contains implementations of two algorithms for Grammar Compression modulo X:
- GC* (gc_/ folder): generates grammar rules with a fixed size X. This value must be provided as input.
- GCX (gcx/ folder): allows variable rule sizes X across grammar levels, based on the average Longest Common Prefix (LCP) of substrings in the text.
To generate the executables for both versions, simply run the following command:
./build.sh allThis command will:
- Compile the GC* project, creating the gc_star executable in the current directory.
- Compile the GCX project, creating the gcx executable in the current directory.
After successful execution, you will see success messages indicating that both executables have been generated.
To compile only the GC* project, run the following command:
./build.sh gc_starThis command will create the gc_star executable in the current directory.
To compile only the GCX project, run the following command:
./build.sh gcxThis command will create the gcx executable in the current directory
- To compress a text using GC*, execute:
./gc_star -c <plain_file> <output_file> <coverage>- To compress a text using GCX, execute:
./gcx -c <plain_file> <output_file>- To decompress a text using GC*, execute:
./gc_star -d <compressed_file> <output_file> <coverage>- To decompress a text using GCX, execute:
./gcx -d <compressed_file> <output_file>- To extract substrings from the compressed text using GC*, execute:
./gc_star -e <compressed_file> <output_file> <coverage> <queries_file>- To extract substrings from the compressed text using GCX, execute:
./gcx -e <compressed_file> <output_file> <queries_file>Note: The queries_file format is the same for both GC* and GCX extractions. The first line of the queries_file consists of two numbers. The first number represents the number of intervals, and the second number indicates the size of these intervals. The subsequent lines contain two numbers, which define the intervals, separated by a blank space.
The results of our experiments are stored in a separate repository: gcx-experiments, where you can find detailed analysis and output data. The repository helps track the performance and comparison of the algorithms under different conditions and datasets.
In the experiments/file_paths/ folder, we maintain links to some datasets used during the experiments. These datasets are used for testing the compression, decompression, and extraction processes.
The following table provides a general overview of the compression and decompression times, as well as the average compression ratio for different algorithms tested in the experiments using repetitive datasets.
| File | Input Size (MB) | bzip2 | RePair (32Fblc) | RePair (FblcFblc) | GCIS-ef | GCIS-s8b | GC4 | GC8 | GC16 | GC32 | GCX-y4 | GCX-y8 | GCX-y16 | GCX-y32 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Escherichia_Coli | 107.47 | 4.74 | 19.39 | 19.39 | 5.39 | 5.13 | 2.85 | 2.64 | 2.81 | 2.57 | 5.23 | 4.98 | 5.69 | 5.55 |
| artificial-fib41 | 255.50 | 33.77 | 13.81 | 13.81 | 3.84 | 3.87 | 2.60 | 3.73 | 4.37 | 4.57 | 5.16 | 7.46 | 8.65 | 8.85 |
| artificial-rs.13 | 206.71 | 27.80 | 9.89 | 9.89 | 2.86 | 2.90 | 2.95 | 2.98 | 3.49 | 3.82 | 5.33 | 6.28 | 7.17 | 7.24 |
| artificial-tm29 | 256.00 | 49.87 | 12.95 | 12.95 | 3.42 | 3.46 | 1.74 | 1.69 | 2.04 | 4.27 | 3.98 | 3.54 | 4.10 | 8.48 |
| cere | 439.92 | 18.13 | 70.30 | 70.30 | 21.59 | 21.17 | 11.52 | 10.70 | 11.31 | 10.99 | 25.41 | 22.23 | 22.12 | 22.44 |
| coreutils | 195.77 | 8.30 | 45.60 | 45.60 | 8.83 | 8.69 | 5.80 | 5.09 | 4.96 | 4.78 | 13.15 | 10.19 | 9.82 | 9.53 |
| dblp.xml.00001.1 | 100.00 | 4.93 | 20.25 | 20.25 | 4.02 | 3.83 | 2.55 | 2.35 | 2.19 | 2.15 | 5.97 | 4.46 | 4.56 | 4.40 |
| dblp.xml.00001.2 | 100.00 | 4.92 | 18.82 | 18.82 | 4.09 | 3.90 | 2.50 | 2.23 | 2.28 | 2.24 | 5.89 | 4.61 | 4.59 | 4.49 |
| dblp.xml.0001.1 | 100.00 | 4.88 | 19.52 | 19.52 | 4.05 | 3.97 | 2.52 | 2.18 | 2.22 | 2.22 | 5.74 | 4.58 | 4.59 | 4.57 |
| dblp.xml.0001.2 | 100.00 | 4.83 | 18.31 | 18.31 | 4.12 | 3.96 | 2.52 | 2.21 | 2.24 | 2.18 | 5.89 | 4.55 | 4.57 | 4.57 |
| dna.001.1 | 100.00 | 4.31 | 28.77 | 28.77 | 4.33 | 4.21 | 2.14 | 2.26 | 2.32 | 2.38 | 4.99 | 4.66 | 4.81 | 4.81 |
| einstein.de.txt | 88.46 | 4.35 | 17.27 | 17.27 | 3.56 | 3.50 | 2.31 | 2.13 | 2.02 | 2.00 | 5.15 | 4.18 | 4.03 | 3.97 |
| einstein.en.txt | 445.96 | 22.73 | 94.05 | 94.05 | 22.78 | 22.49 | 12.50 | 11.35 | 10.95 | 10.76 | 28.90 | 22.52 | 21.31 | 20.71 |
| english.001.2 | 100.00 | 4.24 | 22.09 | 22.09 | 4.86 | 4.76 | 2.62 | 2.31 | 2.48 | 2.35 | 6.11 | 5.04 | 5.03 | 5.03 |
| influenza | 147.64 | 8.54 | 23.23 | 23.23 | 5.85 | 5.80 | 3.34 | 3.48 | 3.53 | 3.46 | 7.70 | 6.92 | 6.94 | 6.92 |
| kernel | 246.01 | 9.64 | 51.02 | 51.02 | 12.09 | 11.92 | 7.32 | 6.57 | 6.38 | 6.17 | 17.02 | 13.23 | 12.52 | 12.30 |
| para | 409.38 | 17.11 | 69.96 | 69.96 | 20.70 | 20.14 | 11.32 | 10.31 | 10.86 | 10.46 | 24.52 | 21.55 | 21.38 | 21.34 |
| proteins.001.1 | 100.00 | 4.62 | 24.56 | 24.56 | 5.05 | 4.81 | 2.65 | 2.26 | 2.47 | 2.31 | 6.13 | 4.72 | 4.87 | 4.53 |
| sources.001.2 | 100.00 | 3.99 | 20.63 | 20.63 | 4.33 | 4.11 | 2.50 | 2.28 | 2.24 | 2.39 | 5.95 | 4.92 | 4.88 | 4.92 |
| world_leaders | 44.79 | 1.18 | 6.14 | 6.14 | 1.21 | 1.18 | 0.78 | 0.74 | 0.81 | 0.88 | 1.67 | 1.43 | 1.53 | 1.72 |
| File | Input Size (MB) | bzip2 | RePair (32Fblc) | RePair (FblcFblc) | GCIS-ef | GCIS-s8b | GC4 | GC8 | GC16 | GC32 | GCX-y4 | GCX-y8 | GCX-y16 | GCX-y32 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Escherichia_Coli | 107.47 | 2.19 | 0.86 | 0.86 | 0.82 | 0.82 | 0.39 | 0.21 | 0.44 | 0.38 | 0.43 | 0.21 | 0.46 | 0.44 |
| artificial-fib41 | 255.50 | 3.30 | 1.01 | 1.01 | 0.84 | 0.92 | 0.56 | 0.31 | 0.23 | 0.18 | 0.96 | 0.36 | 0.24 | 0.18 |
| artificial-rs.13 | 206.71 | 2.61 | 0.82 | 0.82 | 0.66 | 0.72 | 0.43 | 0.26 | 0.19 | 0.15 | 0.76 | 0.29 | 0.19 | 0.15 |
| artificial-tm29 | 256.00 | 3.29 | 1.04 | 1.04 | 0.78 | 0.85 | 0.55 | 0.32 | 0.23 | 0.18 | 0.52 | 0.32 | 0.23 | 0.18 |
| cere | 439.92 | 8.53 | 3.42 | 3.42 | 2.44 | 3.00 | 2.33 | 1.21 | 1.93 | 1.70 | 2.55 | 1.65 | 2.12 | 1.74 |
| coreutils | 195.77 | 2.61 | 1.41 | 1.41 | 1.07 | 1.38 | 0.80 | 0.56 | 0.56 | 0.56 | 1.13 | 0.58 | 0.58 | 0.54 |
| dblp.xml.00001.1 | 100.00 | 1.33 | 0.59 | 0.59 | 0.37 | 0.55 | 0.22 | 0.14 | 0.10 | 0.08 | 0.43 | 0.18 | 0.15 | 0.12 |
| dblp.xml.00001.2 | 100.00 | 1.35 | 0.58 | 0.58 | 0.38 | 0.54 | 0.22 | 0.14 | 0.10 | 0.08 | 0.44 | 0.18 | 0.15 | 0.11 |
| dblp.xml.0001.1 | 100.00 | 1.33 | 0.60 | 0.60 | 0.39 | 0.52 | 0.26 | 0.14 | 0.10 | 0.09 | 0.43 | 0.18 | 0.15 | 0.13 |
| dblp.xml.0001.2 | 100.00 | 1.44 | 0.61 | 0.61 | 0.40 | 0.53 | 0.25 | 0.14 | 0.10 | 0.08 | 0.43 | 0.18 | 0.15 | 0.14 |
| dna.001.1 | 100.00 | 2.05 | 0.79 | 0.79 | 0.54 | 0.65 | 0.25 | 0.14 | 0.12 | 0.10 | 0.43 | 0.19 | 0.16 | 0.23 |
| einstein.de.txt | 88.46 | 1.14 | 0.57 | 0.57 | 0.30 | 0.56 | 0.23 | 0.14 | 0.10 | 0.09 | 0.41 | 0.15 | 0.11 | 0.10 |
| einstein.en.txt | 445.96 | 5.71 | 3.10 | 3.10 | 1.44 | 2.71 | 1.20 | 0.72 | 0.53 | 0.47 | 1.99 | 0.78 | 0.54 | 0.45 |
| english.001.2 | 100.00 | 2.26 | 0.90 | 0.90 | 0.60 | 0.78 | 0.27 | 0.16 | 0.12 | 0.10 | 0.50 | 0.26 | 0.25 | 0.29 |
| influenza | 147.64 | 1.95 | 0.86 | 0.86 | 0.64 | 0.87 | 0.40 | 0.27 | 0.21 | 0.25 | 0.69 | 0.26 | 0.22 | 0.27 |
| kernel | 246.01 | 3.66 | 1.85 | 1.85 | 1.31 | 1.74 | 1.15 | 0.68 | 0.89 | 0.88 | 1.42 | 0.84 | 0.88 | 0.84 |
| para | 409.38 | 8.19 | 3.23 | 3.23 | 2.43 | 2.90 | 2.27 | 1.13 | 1.86 | 1.63 | 2.53 | 1.67 | 1.89 | 1.73 |
| proteins.001.1 | 100.00 | 2.15 | 0.77 | 0.77 | 0.55 | 0.80 | 0.26 | 0.16 | 0.12 | 0.10 | 0.48 | 0.19 | 0.14 | 0.16 |
| sources.001.2 | 100.00 | 1.81 | 0.84 | 0.84 | 0.57 | 0.71 | 0.27 | 0.16 | 0.11 | 0.10 | 0.50 | 0.26 | 0.19 | 0.29 |
| world_leaders | 44.79 | 0.39 | 0.27 | 0.27 | 0.17 | 0.18 | 0.12 | 0.07 | 0.05 | 0.06 | 0.19 | 0.09 | 0.06 | 0.06 |
| File | Input Size (MB) | bzip2 | RePair (32Fblc) | RePair (FblcFblc) | GCIS-ef | GCIS-s8b | GC4 | GC8 | GC16 | GC32 | GCX-y4 | GCX-y8 | GCX-y16 | GCX-y32 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Escherichia_Coli | 107.47 | 26.98 | 56.44 | 48.43 | 16.68 | 13.16 | 50.06 | 31.96 | 91.03 | 95.15 | 39.59 | 32.49 | 78.94 | 85.64 |
| artificial-fib41 | 255.50 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.06 | 0.07 | 0.08 | 0.16 | 0.05 | 0.05 | 0.08 | 0.06 |
| artificial-rs.13 | 206.71 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.09 | 0.10 | 0.13 | 0.20 | 0.08 | 0.09 | 0.11 | 0.09 |
| artificial-tm29 | 256.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| cere | 439.92 | 25.22 | 13.12 | 11.37 | 4.44 | 3.40 | 33.67 | 26.33 | 57.56 | 68.88 | 37.43 | 31.56 | 54.27 | 60.36 |
| coreutils | 195.77 | 16.02 | 18.46 | 15.82 | 5.93 | 4.71 | 43.38 | 43.15 | 53.87 | 64.66 | 44.20 | 42.79 | 50.21 | 57.28 |
| dblp.xml.00001.1 | 100.00 | 10.97 | 1.64 | 1.32 | 0.47 | 0.34 | 0.59 | 0.66 | 0.75 | 0.94 | 1.66 | 17.42 | 12.31 | 26.68 |
| dblp.xml.00001.2 | 100.00 | 11.24 | 1.68 | 1.35 | 0.47 | 0.34 | 0.59 | 0.66 | 0.75 | 0.94 | 1.76 | 17.72 | 12.60 | 27.35 |
| dblp.xml.0001.1 | 100.00 | 11.03 | 2.81 | 2.25 | 0.90 | 0.66 | 1.01 | 1.16 | 1.44 | 1.94 | 2.08 | 17.58 | 12.53 | 26.96 |
| dblp.xml.0001.2 | 100.00 | 13.15 | 3.19 | 2.60 | 0.90 | 0.66 | 1.01 | 1.16 | 1.44 | 1.94 | 2.85 | 20.53 | 15.36 | 33.37 |
| dna.001.1 | 100.00 | 26.12 | 12.23 | 10.04 | 4.24 | 3.17 | 3.49 | 4.20 | 6.87 | 9.06 | 5.21 | 19.26 | 34.94 | 42.74 |
| einstein.de.txt | 88.46 | 4.32 | 1.25 | 0.98 | 0.34 | 0.25 | 7.69 | 9.41 | 13.11 | 14.41 | 7.96 | 8.59 | 11.22 | 13.23 |
| einstein.en.txt | 445.96 | 5.17 | 0.63 | 0.51 | 0.22 | 0.17 | 3.74 | 4.69 | 6.67 | 8.74 | 3.82 | 4.31 | 5.95 | 7.14 |
| english.001.2 | 100.00 | 37.32 | 16.37 | 13.73 | 4.87 | 3.66 | 4.48 | 5.52 | 6.92 | 9.12 | 12.12 | 27.00 | 50.79 | 82.67 |
| influenza | 147.64 | 6.59 | 16.99 | 14.27 | 5.66 | 4.31 | 16.49 | 18.80 | 24.61 | 33.73 | 16.84 | 17.36 | 23.15 | 31.35 |
| kernel | 246.01 | 21.74 | 8.41 | 7.21 | 2.69 | 2.12 | 38.90 | 39.51 | 48.41 | 66.19 | 40.91 | 36.55 | 44.34 | 53.91 |
| para | 409.38 | 26.15 | 18.32 | 16.10 | 5.73 | 4.52 | 37.50 | 28.42 | 65.02 | 77.31 | 42.01 | 35.72 | 55.95 | 67.27 |
| proteins.001.1 | 100.00 | 40.59 | 13.63 | 11.20 | 4.87 | 3.67 | 4.54 | 5.69 | 6.88 | 9.09 | 7.00 | 15.32 | 27.37 | 30.80 |
| sources.001.2 | 100.00 | 33.02 | 15.78 | 13.23 | 4.72 | 3.59 | 4.54 | 5.37 | 6.82 | 9.09 | 11.87 | 25.26 | 45.64 | 79.33 |
| world_leaders | 44.79 | 6.94 | 11.53 | 9.45 | 3.85 | 2.85 | 21.75 | 25.72 | 30.56 | 37.63 | 22.80 | 24.09 | 26.89 | 33.45 |
The following table provides an overview of the average extraction times for different algorithms used in the experiments with repetitive datasets.
| Algorithm | Extraction Time |
|---|---|
| GCX | M secs |
| GCX* | N secs |
| GCIS-ef | O secs |
| GCIS-s8b | P secs |
| Repair (PlainSlp_32Fblc) | Q secs |
| Repair (PlainSlp_FblcFblc) | R secs |
Note: Detailed experimental results and analyses can be found in the Experiments Repository.