Travis (Ubuntu artful, macOS) |
AppVeyor (Windows) |
Shippable (Debian sid) |
Wercker (Alpine Linux, Arch, Fedora) |
---|---|---|---|
Coverity | Codacy | Coveralls | Codecov |
---|---|---|---|
This is a fast C++ implementation of some backend generation functionality used by Kameris.
libkameris
supports the following representations of DNA sequences:
and the following distances:
- Euclidean
- Squared Euclidean
- Manhattan
- Cosine
- Pearson Correlation
- SSIM
- Approximate Information Distance
You may be interested the following papers for further reference:
- An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes
- Mapping the Space of Genomic Signatures
- An investigation into inter- and intragenomic variations of graphic genomic signatures
- Additive methods for genomic signatures
Pre-built binaries for Windows, Linux, and macOS are available here.
For best performance, make sure you download the version corresponding to the latest instruction set your CPU supports (if unsure, you can check with CPU-Z, or just pick the SSE version and sacrifice performance).
Try putting some FASTA files in a directory fasta
, creating a directory output
, then running:
generation_cgr cgr fasta/ output/cgr-k=9.mm-repr 9 32
generation_dists output/cgr-k=9.mm-repr output/dists manhat
The format of output files is documented here. generation_cgr
will generate files with the extension .mm-repr
and generation_dists
will generate files with the extension .mm-dist
.
Libraries to read and write these formats, for C++ and Python, can be found in the kameris-formats
folder.
benchmarks
: Used to generate the benchmark results belowexternal
: External libraries and codelibkameris
: A reusable header-only library for working with the representations and distances of this projectdistances
: Distance computation for dense vectors/matricesdistances-sparse
: Distance computation for sparse vectors/matricesio
: Reading and writing FASTA sequences and binary-encoded representationsrepresentations
: Generating representations (FCGR, Descriptors, ...)utils
: Other useful tools (parallel execution, timing, etc.)
kameris-formats
: Libraries to read and write the.mm-repr
and.mm-dist
file formats, this is a link to stephensolis/kameris-formatssrc
: The code for two minimal command-line interfaces implementing CGR generation (generation_cgr
) and distance matrix generation (generation_dists
)tests
: The test suite
Below, 'Mathematica' means stephensolis/modmap-generator at commit 212d0fda, and 'C++' means this repository at commit 8af2864a.
Tests were performed on an AWS c4.4xlarge (8 cores/16 threads of an Intel Xeon E5-2666 v3), using Mathematica 11.0.1 and the AVX2-release version of this software. Benchmarking code can be found in the benchmarks/
folder.
Test | C++ | Mathematica | Speedup |
---|---|---|---|
CGR | 0.704s | 36.95s | 52.5x |
Descriptors | 0.0169s | 2.253s | 133x |
Euclidean | 0.158s | 2.920s | 18.5x |
Manhattan | 0.139s | 2.871s | 20.7x |
Approximate Information Distance | 0.229s | 27.20s | 119x |
Cosine | 0.154s | 2.500s | 16.2x |
Pearson | 0.361s | 8.309s | 23.0x |
SSIM | 14.27s | 132.3s | 9.27x |
This project requires:
- a C++14-compatible compiler
- a fairly recent Boost (at least 1.61)
- CMake (at least 3.1)
- if you wish to run the full test suite: clang-format, clang-tidy, LCOV,
diff
,grep
, andperl
The following compilers have been tested:
Compiler | Version |
---|---|
Microsoft Visual Studio | 2017 (19.1) |
Intel C++ Compiler | 2017 (17.0), 2018 (18.0) |
GCC | 5.3, 5.4, 6.2, 6.3, 6.4, 7.1 |
Clang | 3.9, 4.0, 5.0, 6.0 Apple LLVM 6.1, 7.0, 7.3, 8.0, 8.1, 9.0 |
Create a new directory, and inside the directory run cmake (<options>) <path to source code>
followed by make
(on a Unix platform) or nmake
(on Windows).
On Windows, the default Visual Studio generators are untested and may not work. Use the NMake generator (-G"NMake Makefiles"
) instead.
You can set -DCMAKE_BUILD_TYPE=
to one of the following possible values:
Debug
(default): build a single binarykameris-cli
in debug mode (including debugging information)Release
: build a single binarykameris-cli
in release modeReleaseIntel
: use Intel's C++ compiler (icl
/icpc
) to build ten separate binaries, each optimized for a different platform:generation_[cgr|dists]_sse3
: optimized for SSE3generation_[cgr|dists]_sse41
: optimized for SSE4.1generation_[cgr|dists]_avx
: optimized for AVXgeneration_[cgr|dists]_avx2
: optimized for AVX2generation_[cgr|dists]_avx512
: optimized for AVX-512
Coverage
: used to compute line coverage for tests with thecoverage
target (see below)
On Windows when not using MSVC (eg. Intel's compiler with -DCMAKE_BUILD_TYPE=ReleaseIntel
), you likely need to set -DBOOST_ROOT=
, -DBOOST_LIBRARYDIR=
, and -DBoost_COMPILER=
, for example as follows:
cmake -G"NMake Makefiles" -DBOOST_ROOT=C:\boost -DBOOST_LIBRARYDIR=C:\boost\lib64-msvc-14.1 -DBoost_COMPILER=-vc141
First run cmake
as specified above, then run make <target>
(on a Unix platform) or nmake <target>
(on Windows), where <target>
may be:
check
orcheck-all
: runs all tests listed below exceptcoverage
check-tests
: runs the test suite from thetests
directorycheck-format
: runs code style checks (this requiresclang-format
,diff
, andperl
)check-lint
: runs static analysis checks (this requiresclang-tidy
,grep
, andperl
)coverage
: computes line coverage for the tests in thetests
directory, saving the result in LCOV format ascoverage.info
(this requires GCC as the compiler,lcov
, and setting-DCMAKE_BUILD_TYPE=Coverage
)
Note: the files in the external
directory are licensed as specified in their respective header comments.
The MIT License (MIT)
Copyright (c) 2017 Stephen
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.