Here are the implementations of "FMtree: A fast locating algorithm of FM-indexes for genomic data". This project consists of three locating algorithms: FMtree, Original_s and Original_v. We also provide a program named "preprocess" to help users to preprocess the input text (genome).
For FMtree, Original_s and Original_v:
-
For each method, first make it by giving the command 'make' to generate binary.
-
After that, build the index for the input text. For example, consider a text "human.fasta", its index consists of "human.fasta.index", "human.fasta.index.bwt", "human.fasta.index.sa" and "human.fasta.index.occ".
Please note that the input text can only include the characters which belong to {a, c, g, t, A, C, G, T}
. -
When searching, the pattern must be saved in "patterns.txt".
Like the text, patterns cannot includes any character which does not belong to {a, c, g, t, A, C, G, T}
. If some patterns consist of such characters, the results of FMtree would be incorrect. Users can randomly generate patterns using our programs (FMtree/Original_v/Original_s), or generate by themselves. -
Example: If users want to run FMtree, they should first type "./FMtree". Then FMtree will report the following information:
Users can input 1, 2 and 3 to build index, search patterns and generate random patterns, respectively.
For the program named "preprocess":
-
This program is used to preprocess the text (genome). By utilizing this program, any character which does not belong {a, c, g, t, A, C, G, T} is randomly converted to one of {a, c, g, t, A, C, G, T}. The output text will only consists of the sequence of genome. Other information, like chrome names and chrome length will be removed. Note that the input text (genome) must be formated using .fa or .fasta format.
-
Usage: ./preprocess --index input_text (note that the output text will be saved in input_text.not_N).
-
Example: ./preprocess --index human.fasta (the output text is saved in human.fasta.not_N).
Example data:
-
Mouse genome including 2.73 billion characters can be found in: ftp://ftp.ensembl.org/pub/release-89/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz. Please note that before building the index for mouse genome, users should preprocess it using the program named "preprocess". By utilizing this program, any character which does not belong {a, c, g, t, A, C, G, T} is randomly converted to one of {a, c, g, t, A, C, G, T}.
-
Human genome including 3.16 billion characters can be found in: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz. Please note that before building the index for human genome, users should preprocess it using the program named "preprocess". By utilizing this program, any character which does not belong {a, c, g, t, A, C, G, T} is randomly converted to one of {a, c, g, t, A, C, G, T}.
-
Dna.200MB consists of 209.72 million characters from Pizza&Chili corpus can be found in: http://pizzachili.dcc.uchile.cl/texts/dna/dna.200MB.gz. Note that since this file is not formated by fasta format, users should manually convert the characters which not belong to {a, c, g, t, A, C, G, T}.
To use the FMtree library, you should first include "FMtree/bwt.h" in your project. After that, compile it with " -O3 -mpopcnt" options and link it with "FMtree/saca-k.cpp" and "FMtree/bwt.cpp".
There are several functions you can call when using FMtree:
- build: Create FM-index with sampling distance
compress_sa
from (*input_refer
)[0..text_length
-1]. The index will be saved in "filename
.index", "filename
.index.bwt", "filename
.index.sa" and "filename
.index.occ":
unsigned int indenpendent_creadte_index(unsigned int text_length, char** input_refer, unsigned int compress_sa, char* filename);
- load: Load the index from FM-index with prefix
fileName
:
unsigned int load_index(char* filename_prefix);
- count (when the sampling distance is at least 4): Get the result of count query of
pattern
[0..length
-1]. The return value is the number of occurrences ofpattern
[0..length
-1]. Ifpattern
[0..length
-1] is exist, writesp
andep
such that [sp
,ep
) is the SA range ofpattern
[0..length
-1]. Besides, ifpattern
[0..length
-1] is exist, this function also writessp1
andep1
such that [sp1
,ep1
) is the SA range ofpattern
[1..length
-1].
inline unsigned int count(char* pattern, unsigned int length, unsigned int* sp, unsigned int* ep, unsigned int* sp1, unsigned int* ep1)
- locate (when the sampling distance is at least 4): Get the result of locate query of
pattern
[0..length_read
-1]. [sp
,ep
) is the SA range ofpattern
[0..length
-1], and [sp1
,ep1
) is the SA range ofpattern
[1..length
-1]. The obtained positions ofpattern
[0..length_read
-1] will be written intolocates
, and the number of obtained positions will be written intooccurrences
. Note thatlocates
should be allocated and freed by users.
unsigned int locate(char* pattern, unsigned int sp, unsigned int ep,
unsigned int sp1, unsigned int ep1, unsigned int* locates, unsigned int length_read, unsigned int* occurrences);
- count (when the sampling distance is not more than 3): Get the result of count query of
pattern
[0..length
-1]. The return value is the number of occurrences ofpattern
[0..length
-1]. Ifpattern
[0..length
-1] is exist, writesp
andep
such that [sp
,ep
) is the SA range ofpattern
[0..length
-1].
inline unsigned int count_less_than_4(char* pattern, unsigned int length, unsigned int* sp, unsigned int* ep)
- locate (when the sampling distance is not more than 3): Get the result of locate query of
pattern
[0..length_read
-1]. [sp
,ep
) is the SA range ofpattern
[0..length
-1], and [sp1
,ep1
) is the SA range ofpattern
[1..length
-1]. The obtained positions ofpattern
[0..length_read
-1] will be written intolocates
, and the number of obtained positions will be written intooccurrences
. Note thatlocates
should be allocated and freed by users.
unsigned int locate_less_than_4(char* pattern, unsigned int sp, unsigned int ep, unsigned int sp1, unsigned int ep1, unsigned int* locates, unsigned int length_read, unsigned int* occurrences);
A complete example can be found in "FMtree/main.cpp".
-
We adopt the SACA-K algorithm [1] to build the suffix array, and build BWT from suffix array. As such when building the index, the memory requirement of FMtree, Original_s and Original_v is about 5 times larger than that of the input text.
-
Please note that FMtree, Original_s and Original_v do not share a same index. For each method, users should build its own index.
[1] Nong G. Practical linear-time O (1)-workspace suffix sorting for constant alphabets[J]. ACM Transactions on Information Systems (TOIS), 2013, 31(3): 15.