Skip to content
Rayhan Shikder edited this page Mar 27, 2019 · 27 revisions

Installation

OpenMP Installation

The gcc compilers usually come with built-in support for OpenMP. However, for Mac OS you may need to change some configurations to make OpenMP working.

Mac OS X

While running the make command of this tool in your Mac, if you face fatal error: 'omp.h' file not found error, that means the version of clang in your PC doesn't support OpenMP. If you do not see any error, you are good to go. If you get the error, you may have to reinstall the gcc using brew reinstall gcc
Then you have to link to the installed path by modifying your $PATH variable. Or you can forcefully create the symbolic links (if asked) while reinstalling gcc. To do so, run brew link --overwrite gcc

Now to use OpenMP you have to specify the proper gcc compiler (e.g., gcc, or gcc-7, or gcc-8), which supports OpenMP. In the makefile inside the Tool directory, please change the CC value (name of the compiler) to the specific name of the compiler (e.g., gcc, or gcc-7, or gcc-8).
Reference: Installing OpenMP on Mac OS X 10.11

Linux, Windows

The gcc compilers in Linux come with built-in support for OpenMP. So, you do not need to do anything there.
For Windows you may need to install a suitable version of gcc to use OpenMP. You can look at the following reference. Getting started with openMP. install on windows
To use OpenMP in Visual Studio look here OpenMP in Visual C++

Datasets

The Data directory contains the datasets used to evaluate our tool. All the data files consist of three sequences. The first two are the sequences to compare between, and the third sequence denotes the unique characters of a sequence (e.g., ATCG in this case). The first three lines of every data file consist of the lengths of these three sequences followed by the sequence strings themselves in the next three lines.
Two different datasets were used for this purpose.

  1. The first one is a simulated data (inside the simulated directory) downloaded from Here (the GC content was changed for each sequence downloaded) . The dataset descriptions are given below.
File names / Lengths Length of 1st Sequence (bp) Length of 2nd Sequence (bp)
1.txt 128 127
2.txt 256 255
3.txt 512 511
4.txt 1024 1023
5.txt 2048 2047
6.txt 4096 4095
7.txt 8192 8191
8.txt 16384 16383
9.txt 32768 32767
10.txt 65536 65535
11.txt 131072 131071
  1. The second dataset (inside real_data directory) consists of real DNA sequences of viruses and eukaryotes (source: NCBI). The first 8 files consist of virus genomes whereas, the last 2 files consist of genomes of entire chromosomes of two eukaryotes.
File names / Sequences Sequence 1 Sequence 2
1.txt Potato spindle tuber viroid (360 bp) Tomato apical stunt viroid (359 bp)
2.txt Rottboellia yellow mottle virus (4194 bp) Carrot mottle virus (4193 bp)
3.txt Rehmannia mosaic virus (6395 bp) Tobacco mosaic virus (6395 bp)
4.txt Potato virus A (9588 bp) Soybean mosaic virus N (9585 bp)
5.txt Chicken megrivirus (9566 bp) Chicken picornavirus 4 (9564 bp)
6.txt Microbacterium phage VitulaEligans (17534 bp) Rhizoctonia cerealis alphaendornavirus 1 (17486 bp)
7.txt Lucheng Rn rat coronavirus (28763 bp) Helicobacter phage Pt1918U (28760 bp)
8.txt Lactococcus phage ASCC368 (32276 bp) Uncultured Mediterranean phage uvMED (32133 bp)
9.txt Athene Cunicularia (Chromosome- 25, 1505370 bp) Bombus Terrestris (Chromosome- LG B18, 3078061 bp)
10.txt Athene Cunicularia (Chromosome- 25, 1505370 bp) Bombus Terrestris (Chromosome- LG B01, 16199981 bp)

How to Run?

At first, clone or download this repository. Then go to the Tool directory and run make command from Terminal/Command Prompt. This will create an executable file named find_lcs.

Prepare the Input

At first get the sequence files (e.g., FASTA) to compare from a convenient source (e.g., NCBI, Simulated Data from UCR etc.). Make sure the files consist of only nucleotide bases (A, T, C, or G). Therefore, you may need to remove the description lines (lines started with a >), and any other characters (newline character, whitespaces etc.) from the file.
After collecting the sequence files and removing unnecessary characters from it, create the input text file in the following manner.

  1. Put the sequence lengths (as integer value) of the two sequences in two lines.
  2. Then put the sequence length of unique characters (4) in the third line.
  3. After that, put the sequence strings in three consecutive lines. The first two lines are for the sequences to compare, and the third lines will be for the unique characters (in this case ATCG).

Following is an example of a small input file (dummy_input.txt) where the first sequence consists of 16 base pairs, and the second sequence consists of 15 base pairs.

16
15
4
ATATTTCCAAGGACCC
ATTTCCCCCAAGGCA
ATCG

Execution

Now to get the longest common subsequence (LCS) of two sequences, find the path to your desired data file. Then, run the following command
./find_lcs dumm_input.txt > output.txt.
This will use the dummy_input.txt file as your input and write the output in output.txt file.
N.B. This will use the maximum number of threads available in your PC.

Understanding the output

After the completion of the execution, one can find the output of in the output.txt file. The output consists of three parts. The first part provides a summary of the input sequences. The second part provides the number of threads, LCS length, parcentage of match between the two sequences, and the total time taken (in seconds) for the program. Following is the output file by using the ../data/simulated/11.txt file as input and using 4 threads for parallelization.

Your input file: ../data/simulated/11.txt
Length of sequence 1: 131072 bp
Length of sequence 2: 131071 bp

######## Results ########
Number of threads used: 4
Length of the LCS is: 127963
97.63% of the first sequence matches with second one
Total time taken: 112.497263 seconds

N.B. With the increase of sequence lengths and number of threads, the timings of the parallel program improve significantly.
A video tutorial on how to use this tool can be found here: Video Tutorial