This library implements a distributed-memory parallel algorithm for the
construction of suffix arrays, LCP arrays, and suffix trees. The algorithm is implemented in C++11
and MPI.
The algorithms implemented by this codebase are described in the following peer-reviewed publications. Please cite these papers, when using our code for academic purposes:
Flick, Patrick, and Srinivas Aluru. "Parallel distributed memory construction of suffix and longest common prefix arrays." Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ACM, 2015.
Flick, Patrick, and Srinivas Aluru. "Parallel Construction of Suffix Trees and the All-Nearest-Smaller-Values Problem." 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), IEEE, 2017.
Flick, Patrick, and Srinivas Aluru. "Distributed Enhanced Suffix Arrays: Efficient Algorithms for Construction and Querying." Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ACM, 2019.
include/contains the implementation of our algorithms in form of C++ template header files (a header-only library).src/contains the sources for binaries, which make use of the implementations ininclude/.testcontains unit tests for the components of the library.ext/contains external, third-party dependencies/libraries. See the README for details on the third-party libraries used.
cmakeversion >= 2.6- C++11 compatible compiler (tested with gcc and clang)
- an
MPIimplementation supportingMPI-2orMPI-3. - external (third-party) dependencies are included in the
ext/directory
To compile the executables and tests via cmake run the following:
mkdir build
cd build
cmake -D CMAKE_BUILD_TYPE=Release ../
makeAfter compiling, there will a multiple binaries available in the build/bin
folder. Running --help on them will give more detailed usage information.
Here's a short overview over the different binaries and their function:
psacis our main executable. This will construct the suffix and LCP array of a given input file. Run withmpirunfor parallel execution.benchmark_sacbenchmarks multiple of our methods. Run withmpirun.dssis a wrapper aroundlibdivsufsortthat follows the same command line usage as our other binaries. This is a sequential program. No mpirun needed.psac-vs-dssruns both our suffix array construction andlibdivsufsort, verifies the results against each other and outputs run-times of both.test_*various test executables, testing a variety of our internal methods.
The psac executable can be used to create the Suffix Array and LCP array and save them as binary output files.
This small example shows how to do so:
cd build/bin
# create example input file with string S=mississippi
printf "mississippi" > text.txt
# in parallel, create suffix array for text.txt
# with options:
# -f text.txt: sets the input file
# -l: create LCP array alongside Suffix Array
# -c: run correctness test
# -o text.idx: sets basename for binary output files text.idx.sa64 & text.idx.lcp64
mpirun -np 4 ./psac -l -c -f text.txt -o text.idx
# verify output with ./print64
./print64 text.idx.sa64The last command outputs the suffix array for string S=mississippi in decimal format:
10
7
4
1
0
9
8
6
3
5
2Our code is licensed under the
Apache License 2.0 (see LICENSE).
The licensing does not apply to the ext folder, which contains external
dependencies which are under their own licensing terms.