-
Notifications
You must be signed in to change notification settings - Fork 77
Use the KMC directly from code through the API
Besides the possibility to use KMC via a command-line interface (CLI) it is also possible to use it directly from C++ code.
To use the API one needs to include kmc_runner.h
header file and link the application against libkmc_core.a
.
KMC depends on zlib and bz2, so these libraries must be also used for linking.
The simplest way to prepare all necessary files is to run:
git clone https://github.com/refresh-bio/KMC/
cd KMC
make bin/libkmc_core.a # -j<n_jobs> recommended for faster compilation
As a result, the files needed to use KMC are in following locations:
include\kmc_runner.h
bin\libkmc_core.a
Here is a small working example on how to use KMC from C++ code:
#include "include/kmc_runner.h"
#include <iostream>
int main()
{
try
{
KMC::Runner runner;
KMC::Stage1Params stage1Params;
stage1Params
.SetKmerLen(31)
.SetInputFiles({"test1.fq", "test2.fq"});
auto stage1Result = runner.RunStage1(stage1Params);
KMC::Stage2Params stage2Params;
stage2Params
.SetOutputFileName("31mers");
auto stage2Result = runner.RunStage2(stage2Params);
//print some stats
std::cout << "total k-mers: " << stage2Result.nTotalKmers << "\n";
std::cout << "total unique k-mers: " << stage2Result.nUniqueKmers << "\n";
}
catch(const std::exception& e)
{
std::cerr << e.what() << '\n';
}
}
Assuming the above code is in example.cpp
file the following command may be used to compile the code:
g++ -O3 example.cpp bin/libkmc_core.a -o example -lbz2 -lz -lpthread
As it may be seen from the simple example above the KMC run is divided into stages 1 and 2. Each stage has its own parameters set (wrapped in Stage1Params
and Stage2Params
types) and its own results set (wrapped in Stage1Results
and Stage2Results
types).
Everything is contained in the KMC
namespace.
The execution is splitter into two parts to allow the API user to set some of the Stage2Params
parameters based on Stage1Results
results.
For example, it is possible to determine the memory limit for the second stage based on the estimated number of unique counted k-mers.
This is the main class for k-mer counting. It is a default-constructible type. It defines following methods:
-
Stage1Results RunStage1(const Stage1Params& params)
- run stage 1 of KMC -
Stage2Results RunStage2(const Stage2Params& params)
- run stage 2 of KMCRunStage2
must be called afterRunStage1
.RunStage2
may be omitted when is not needed (for example if one is interested only in k-mer abundance histogram estimation).
This class allows setting all the parameters needed for the first stage of KMC. All the parameters may be set and read using appropriate setters and getters. There are the following setters:
-
Stage1Params& SetInputFiles(const std::vector<std::string>& inputFiles)
- sets the list of input files (all must be in the same format, allowed formats are: fasta, fastq, multi-line fasta, bam, kmc) -
Stage1Params& SetTmpPath(const std::string& tmpPath)
- sets a tmp path for intermediate files (default path is ".") -
Stage1Params& SetKmerLen(uint32_t kmerLen)
- sets the k-mer length -
Stage1Params& SetNThreads(uint32_t nThreads)
- sets the number of threads (default:std::thread::hardware_concurrency
) -
Stage1Params& SetMaxRamGB(uint32_t maxRamGB)
- sets the maxium amount of RAM that KMC is allowed to consume in GB (default: 12 GB) -
Stage1Params& SetSignatureLen(uint32_t signatureLen)
- set a signature length (allowed range: [5;11], default: 9) -
Stage1Params& SetHomopolymerCompressed(bool homopolymerCompressed)
- enable (true)/disable (false) homopolymer compressed k-mers counts (approximate and experimental, default: disabled) -
Stage1Params& SetInputFileType(InputFileType inputFileType)
- sets the input file type, available values are:InputFileType::FASTQ
,InputFileType::FASTA
,InputFileType::MULTILINE_FASTA
,InputFileType::BAM
,InputFileType::KMC
(default: InputFileType::FASTQ) -
Stage1Params& SetCanonicalKmers(bool canonicalKmers)
- count canonical (true) k-mers or not (false) (default: canonical) -
Stage1Params& SetRamOnlyMode(bool ramOnlyMode)
- turn RAM-only mode on (true)/off (false) (default: off) -
Stage1Params& SetNBins(uint32_t nBins)
- sets the number of intermediate files (in range: [64, 2000], default: 512) -
Stage1Params& SetNReaders(uint32_t nReaders)
- sets the number of readers threads (Warning: only for experienced users) -
Stage1Params& SetNSplitters(uint32_t nSplitters)
- sets the number of splitting threads (Warning: only for experienced users) -
Stage1Params& SetVerboseLogger(ILogger* verboseLogger)
- sets the verbose logger, check theILogger
interface description (default: ignoring verbose logs) -
Stage1Params& SetPercentProgressObserver(IPercentProgressObserver* percentProgressObserver)
- sets the observer of a percent progress, check theIPercentProgressObserver
interface description (default: print percent progress onstd::cerr
) -
Stage1Params& SetWarningsLogger(ILogger* warningsLogger)
- sets the warning logger, check theILogger
interface description (default: print logs onstd::cerr
) -
Stage1Params& SetEstimateHistogramCfg(EstimateHistogramCfg estimateHistogramCfg)
- sets the estimate k-mer abundance histogram configuration, available values areEstimateHistogramCfg::DONT_ESTIMATE
,EstimateHistogramCfg::ESTIMATE_AND_COUNT_KMERS
andEstimateHistogramCfg::ONLY_ESTIMATE
(default:EstimateHistogramCfg::DONT_ESTIMATE
). The estimation of k-mer abundance histogram is performed with our implementation ofntCard
algorithm. Detailed explanation of each setting:- For
EstimateHistogramCfg::DONT_ESTIMATE
k-mer abundance histogram is not being estimated, this is the normal mode, because additional histogram estimation may affect performance (although from the preliminary experiments the impact is negligible). - When
EstimateHistogramCfg::ONLY_ESTIMATE
only histogram estimation is performed, the second stage will do nothing in such a case, the intermediate files will not be created, it should be used when only histogram estimation is needed. - When
EstimateHistogramCfg::ESTIMATE_AND_COUNT_KMERS
is used the histogram is estimated, but also the rest of the computations are done as usual, this may be used to determine some of the stage 2 parameters based on the histogram.
- For
-
Stage1Params& SetProgressObserver(IProgressObserver* progressObserver)
- sets the observer of a progress other than percentage, check theIProgressObserver
interface description (default: print progress onstd::cerr
)
Each parameter may be read using one of the following getters:
const std::vector<std::string>& GetInputFiles() const noexcept
const std::string& GetTmpPath() const noexcept
uint32_t GetKmerLen() const noexcept
uint32_t GetNThreads() const noexcept
uint32_t GetMaxRamGB() const noexcept
uint32_t GetSignatureLen() const noexcept
bool GetHomopolymerCompressed() const noexcept
InputFileType GetInputFileType() const noexcept
bool GetCanonicalKmers() const noexcept
bool GetRamOnlyMode() const noexcept
uint32_t GetNBins() const noexcept
uint32_t GetNReaders() const noexcept
uint32_t GetNSplitters() const noexcept
ILogger* GetVerboseLogger() const noexcept
IPercentProgressObserver* GetPercentProgressObserver() const noexcept
ILogger* GetWarningsLogger() const noexcept
EstimateHistogramCfg GetEstimateHistogramCfg() const noexcept
IProgressObserver* GetProgressObserver() const noexcept
This class allows setting all the parameters needed for the second stage of KMC. All the parameters may be set and read using appropriate setters and getters. There are the following setters:
-
Stage2Params& SetMaxRamGB(uint32_t maxRamGB)
- sets the maximum amount of RAM that KMC is allowed to consume, in fact, KMC may use more memory if it is needed to process the data, if the limit should be strict useSetStrictMemoryMode
method -
Stage2Params& SetNThreads(uint32_t nThreads)
- sets the number of threads (default:std::thread::hardware_concurrency
) -
Stage2Params& SetStrictMemoryMode(bool strictMemoryMode)
- enable (true)/ disable (false) strict memory mode, if enabled KMC will not consume more RAM than specified withSetMaxRamGB
method, but the computation may take longer (default: disabled) -
Stage2Params& SetCutoffMin(uint64_t cutoffMin)
- exclude k-mers occurring less thancutoffMin
times (default: 2) -
Stage2Params& SetCounterMax(uint64_t counterMax)
- sets maximal value of a counter (default: 255) -
Stage2Params& SetCutoffMax(uint64_t cutoffMax)
- exclude k-mers occurring more of thancutoffMax
times (default: 1e9) -
Stage2Params& SetOutputFileName(const std::string& outputFileName)
- sets the path of the output file -
Stage2Params& SetOutputFileType(OutputFileType outputFileType)
- sets the format of the output, available values:OutputFileType::KMC
andOutputFileType:KFF
(default: KMC) -
Stage2Params& SetWithoutOutput(bool withoutOutput)
- do not produce (true) output file (default: false) -
Stage2Params& SetStrictMemoryNSortingThreadsPerSorters(uint32_t strictMemoryNSortingThreadsPerSorters)
- sets the number of sorters per sorted in strict memory mode (Warning: only for experienced users) -
Stage2Params& SetStrictMemoryNUncompactors(uint32_t strictMemoryNUncompactors)
- sets the number of uncompactors in strict memory mode (Warning: only for experienced users) -
Stage2Params& SetStrictMemoryNMergers(uint32_t strictMemoryNMergers)
- - sets the number of mergers in strict memory mode (Warning: only for experienced users)
Each parameter may be read using one of the following getters:
uint32_t GetMaxRamGB() const noexcept
uint32_t GetNThreads() const noexcept
bool GetStrictMemoryMode() const noexcept
uint64_t GetCutoffMin() const noexcept
uint64_t GetCounterMax() const noexcept
uint64_t GetCutoffMax() const noexcept
const std::string& GetOutputFileName() const noexcept
OutputFileType GetOutputFileType() const noexcept
bool GetWithoutOutput() const noexcept
uint32_t GetStrictMemoryNSortingThreadsPerSorters() const noexcept
uint32_t GetStrictMemoryNUncompactors() const noexcept
uint32_t GetStrictMemoryNMergers() const noexcept