Skip to content

Use the KMC directly from code through the API

marekkokot edited this page Dec 7, 2021 · 13 revisions

Introduction

Besides the possibility to use KMC via a command-line interface (CLI) it is also possible to use it directly from C++ code. To use the API one needs to include kmc_runner.h header file and link the application against libkmc_core.a. KMC depends on zlib and bz2, so these libraries must be also used for linking.

Simple setup

The simplest way to prepare all necessary files is to run:

git clone https://github.com/refresh-bio/KMC/
cd KMC
make bin/libkmc_core.a # -j<n_jobs> recommended for faster compilation

As a result, the files needed to use KMC are in following locations:

  • include\kmc_runner.h
  • bin\libkmc_core.a

Simple working example

Here is a small working example on how to use KMC from C++ code:

#include "include/kmc_runner.h"
#include <iostream>
int main()
{
    try
    {       
        KMC::Runner runner;

        KMC::Stage1Params stage1Params;
        stage1Params
            .SetKmerLen(31)
            .SetInputFiles({"test1.fq", "test2.fq"});
        
        auto stage1Result = runner.RunStage1(stage1Params);
        
        KMC::Stage2Params stage2Params;

        stage2Params
            .SetOutputFileName("31mers");

        auto stage2Result = runner.RunStage2(stage2Params);

        //print some stats
        std::cout << "total k-mers: " << stage2Result.nTotalKmers << "\n";
        std::cout << "total unique k-mers: " << stage2Result.nUniqueKmers << "\n";
    }
    catch(const std::exception& e)
    {
        std::cerr << e.what() << '\n';
    }
}

Assuming the above code is in example.cpp file the following command may be used to compile the code:

g++ -O3 example.cpp bin/libkmc_core.a -o example -lbz2 -lz -lpthread

The philosophy of an API

As it may be seen from the simple example above the KMC run is divided into stages 1 and 2. Each stage has its own parameters set (wrapped in Stage1Params and Stage2Params types) and its own results set (wrapped in Stage1Results and Stage2Results types). Everything is contained in the KMC namespace. The execution is splitter into two parts to allow the API user to set some of the Stage2Params parameters based on Stage1Results results. For example, it is possible to determine the memory limit for the second stage based on the estimated number of unique counted k-mers.

API documentation

Runner class

This is the main class for k-mer counting. It is a default-constructible type. It defines following methods:

  • Stage1Results RunStage1(const Stage1Params& params) - run stage 1 of KMC
  • Stage2Results RunStage2(const Stage2Params& params) - run stage 2 of KMC RunStage2 must be called after RunStage1. RunStage2 may be omitted when is not needed (for example if one is interested only in k-mer abundance histogram estimation).

Stage1Params class

This class allows setting all the parameters needed for the first stage of KMC. All the parameters may be set and read using appropriate setters and getters. There are the following setters:

  • Stage1Params& SetInputFiles(const std::vector<std::string>& inputFiles) - sets the list of input files (all must be in the same format, allowed formats are: fasta, fastq, multi-line fasta, bam, kmc)

  • Stage1Params& SetTmpPath(const std::string& tmpPath) - sets a tmp path for intermediate files (default path is ".")

  • Stage1Params& SetKmerLen(uint32_t kmerLen) - sets the k-mer length

  • Stage1Params& SetNThreads(uint32_t nThreads) - sets the number of threads (default: std::thread::hardware_concurrency)

  • Stage1Params& SetMaxRamGB(uint32_t maxRamGB) - sets the maxium amount of RAM that KMC is allowed to consume in GB (default: 12 GB)

  • Stage1Params& SetSignatureLen(uint32_t signatureLen) - set a signature length (allowed range: [5;11], default: 9)

  • Stage1Params& SetHomopolymerCompressed(bool homopolymerCompressed) - enable (true)/disable (false) homopolymer compressed k-mers counts (approximate and experimental, default: disabled)

  • Stage1Params& SetInputFileType(InputFileType inputFileType) - sets the input file type, available values are: InputFileType::FASTQ, InputFileType::FASTA, InputFileType::MULTILINE_FASTA, InputFileType::BAM, InputFileType::KMC (default: InputFileType::FASTQ)

  • Stage1Params& SetCanonicalKmers(bool canonicalKmers) - count canonical (true) k-mers or not (false) (default: canonical)

  • Stage1Params& SetRamOnlyMode(bool ramOnlyMode) - turn RAM-only mode on (true)/off (false) (default: off)

  • Stage1Params& SetNBins(uint32_t nBins) - sets the number of intermediate files (in range: [64, 2000], default: 512)

  • Stage1Params& SetNReaders(uint32_t nReaders) - sets the number of readers threads (Warning: only for experienced users)

  • Stage1Params& SetNSplitters(uint32_t nSplitters) - sets the number of splitting threads (Warning: only for experienced users)

  • Stage1Params& SetVerboseLogger(ILogger* verboseLogger) - sets the verbose logger, check the ILogger interface description (default: ignoring verbose logs)

  • Stage1Params& SetPercentProgressObserver(IPercentProgressObserver* percentProgressObserver) - sets the observer of a percent progress, check the IPercentProgressObserver interface description (default: print percent progress on std::cerr)

  • Stage1Params& SetWarningsLogger(ILogger* warningsLogger) - sets the warning logger, check the ILogger interface description (default: print logs on std::cerr)

  • Stage1Params& SetEstimateHistogramCfg(EstimateHistogramCfg estimateHistogramCfg) - sets the estimate k-mer abundance histogram configuration, available values are EstimateHistogramCfg::DONT_ESTIMATE, EstimateHistogramCfg::ESTIMATE_AND_COUNT_KMERS and EstimateHistogramCfg::ONLY_ESTIMATE (default: EstimateHistogramCfg::DONT_ESTIMATE). The estimation of k-mer abundance histogram is performed with our implementation of ntCard algorithm. Detailed explanation of each setting:

    • For EstimateHistogramCfg::DONT_ESTIMATE k-mer abundance histogram is not being estimated, this is the normal mode, because additional histogram estimation may affect performance (although from the preliminary experiments the impact is negligible).
    • When EstimateHistogramCfg::ONLY_ESTIMATE only histogram estimation is performed, the second stage will do nothing in such a case, the intermediate files will not be created, it should be used when only histogram estimation is needed.
    • When EstimateHistogramCfg::ESTIMATE_AND_COUNT_KMERS is used the histogram is estimated, but also the rest of the computations are done as usual, this may be used to determine some of the stage 2 parameters based on the histogram.
  • Stage1Params& SetProgressObserver(IProgressObserver* progressObserver) - sets the observer of a progress other than percentage, check the IProgressObserver interface description (default: print progress on std::cerr)

Each parameter may be read using one of the following getters:

  • const std::vector<std::string>& GetInputFiles() const noexcept
  • const std::string& GetTmpPath() const noexcept
  • uint32_t GetKmerLen() const noexcept
  • uint32_t GetNThreads() const noexcept
  • uint32_t GetMaxRamGB() const noexcept
  • uint32_t GetSignatureLen() const noexcept
  • bool GetHomopolymerCompressed() const noexcept
  • InputFileType GetInputFileType() const noexcept
  • bool GetCanonicalKmers() const noexcept
  • bool GetRamOnlyMode() const noexcept
  • uint32_t GetNBins() const noexcept
  • uint32_t GetNReaders() const noexcept
  • uint32_t GetNSplitters() const noexcept
  • ILogger* GetVerboseLogger() const noexcept
  • IPercentProgressObserver* GetPercentProgressObserver() const noexcept
  • ILogger* GetWarningsLogger() const noexcept
  • EstimateHistogramCfg GetEstimateHistogramCfg() const noexcept
  • IProgressObserver* GetProgressObserver() const noexcept

Stage2Params class

This class allows setting all the parameters needed for the second stage of KMC. All the parameters may be set and read using appropriate setters and getters. There are the following setters:

  • Stage2Params& SetMaxRamGB(uint32_t maxRamGB) - sets the maximum amount of RAM that KMC is allowed to consume, in fact, KMC may use more memory if it is needed to process the data, if the limit should be strict use SetStrictMemoryMode method
  • Stage2Params& SetNThreads(uint32_t nThreads) - sets the number of threads (default: std::thread::hardware_concurrency)
  • Stage2Params& SetStrictMemoryMode(bool strictMemoryMode) - enable (true)/ disable (false) strict memory mode, if enabled KMC will not consume more RAM than specified with SetMaxRamGB method, but the computation may take longer (default: disabled)
  • Stage2Params& SetCutoffMin(uint64_t cutoffMin) - exclude k-mers occurring less than cutoffMin times (default: 2)
  • Stage2Params& SetCounterMax(uint64_t counterMax) - sets maximal value of a counter (default: 255)
  • Stage2Params& SetCutoffMax(uint64_t cutoffMax) - exclude k-mers occurring more of than cutoffMax times (default: 1e9)
  • Stage2Params& SetOutputFileName(const std::string& outputFileName) - sets the path of the output file
  • Stage2Params& SetOutputFileType(OutputFileType outputFileType) - sets the format of the output, available values: OutputFileType::KMC and OutputFileType:KFF (default: KMC)
  • Stage2Params& SetWithoutOutput(bool withoutOutput) - do not produce (true) output file (default: false)
  • Stage2Params& SetStrictMemoryNSortingThreadsPerSorters(uint32_t strictMemoryNSortingThreadsPerSorters) - sets the number of sorters per sorted in strict memory mode (Warning: only for experienced users)
  • Stage2Params& SetStrictMemoryNUncompactors(uint32_t strictMemoryNUncompactors) - sets the number of uncompactors in strict memory mode (Warning: only for experienced users)
  • Stage2Params& SetStrictMemoryNMergers(uint32_t strictMemoryNMergers) - - sets the number of mergers in strict memory mode (Warning: only for experienced users)

Each parameter may be read using one of the following getters:

  • uint32_t GetMaxRamGB() const noexcept
  • uint32_t GetNThreads() const noexcept
  • bool GetStrictMemoryMode() const noexcept
  • uint64_t GetCutoffMin() const noexcept
  • uint64_t GetCounterMax() const noexcept
  • uint64_t GetCutoffMax() const noexcept
  • const std::string& GetOutputFileName() const noexcept
  • OutputFileType GetOutputFileType() const noexcept
  • bool GetWithoutOutput() const noexcept
  • uint32_t GetStrictMemoryNSortingThreadsPerSorters() const noexcept
  • uint32_t GetStrictMemoryNUncompactors() const noexcept
  • uint32_t GetStrictMemoryNMergers() const noexcept

Stage1Results struct

This structure stores the results of the stage 1 run. It contains the following values:

  • double time - time spend by KMC to execute stage 1
  • uint64_t nSeqences - total number of input sequences (usually reads)
  • bool wasSmallKOptUsed - true if small k optimization was used
  • uint64_t nTotalSuperKmers - total number of super-k-mers
  • uint64_t tmpSize - total amount of disk memory used by KMC for intermediate files
  • std::vector<uint64_t> estimatedHistogram - estimated histogram, non-empty only if EstimateHistogramCfg::ONLY_ESTIMATE or EstimateHistogramCfg::ESTIMATE_AND_COUNT_KMERS flag was used set by SetEstimateHistogramCfg method. At i-th index of this vector the estimated number of k-mers occurring i times is stored.
Clone this wiki locally