-
Notifications
You must be signed in to change notification settings - Fork 3
Converting to the fmlrc RLE BWT format
The Run-Length Encoded (RLE) format is primarily for reducing the disk space required to store the MSBWT. Additionally, it has the side effect of reducing computation as well, especially in high coverage genomic datasets.
However, there are many other tools for constructing the MSBWT that are not a part of this package. As of fmlrc v1.0.0, we provide access to a "fmlrc-convert" pre-processing program that will take as input a MSBWT string and save it into the RLE format we use. For example, the MSBWT of the string "ACAT$" is "T$CAA", so the following command will convert that string to the RLE format we use and store it on disk:
echo -e "T\$CAA" | fmlrc-convert /path/to/output/comp_msbwt.npy
We note that this tool does not perform any sanity checks on the input given to it, instead just performing the compression and conversion. Giving our conversion tool, a MSBWT that does not follow our definition of a MSBWT may have unexpected consequences in downstream queries.
We recommend using ropebwt2 to build the MSBWT:
- Tool - https://github.com/lh3/ropebwt2
- Paper - Fast construction of FM-index for long sequence reads.
Due to difference in the BWT encoding, the data in and out of ropebwt2 must be manipulated. The following commands will create a sorted, plain-text read file. Then it will run that data through ropebwt2 and fmlrc-convert to create the RLE-BWT that fmlrc expects.
gunzip -c reads.fq.gz | awk 'NR % 4 == 2' | sort | gzip > reads.sorted.txt.gz
gunzip -c reads.sorted.txt.gz | tr NT TN | ropebwt2 -LR | tr NT TN | fmlrc-convert /path/to/output/comp_msbwt.npy
If an intermediate file is unnecessary, then these two lines can be combined into one command:
gunzip -c reads.fq.gz | awk 'NR % 4 == 2' | sort | tr NT TN | ropebwt2 -LR | tr NT TN | fmlrc-convert /path/to/output/comp_msbwt.npy
For more details on what these commands are doing or details on the RLE-BWT format specification, please refer to Converting to Run-Length Encoded (RLE) format from the msbwt python package.