Home

Installation

OpenMP Installation

The gcc compilers usually come with built-in support for OpenMP. However, for Mac OS you may need to change some configurations to make OpenMP working.

Mac OS X

While running the make command of this tool in your Mac, if you face fatal error: 'omp.h' file not found error, that means the version of clang in your PC doesn't support OpenMP. If you do not see any error, you are good to go. If you get the error, you may have to reinstall the gcc using brew reinstall gcc
Then you have to link to the installed path by modifying your $PATH variable. Or you can forcefully create the symbolic links (if asked) while reinstalling gcc. To do so, run brew link --overwrite gcc

Now to use OpenMP you have to specify the proper gcc compiler (e.g., gcc, or gcc-7, or gcc-8), which supports OpenMP. In the makefile inside the Tool directory, please change the CC value (name of the compiler) to the specific name of the compiler (e.g., gcc, or gcc-7, or gcc-8).
Reference: Installing OpenMP on Mac OS X 10.11

Linux, Windows

The gcc compilers in Linux come with built-in support for OpenMP. So, you do not need to do anything there.
For Windows you may need to install a suitable version of gcc to use OpenMP. You can look at the following reference. Getting started with openMP. install on windows
To use OpenMP in Visual Studio look here OpenMP in Visual C++

Datasets

The Data directory contains the datasets used to evaluate our tool. All the data files consist of three sequences. The first two are the sequences to compare between, and the third sequence denotes the unique characters of a sequence (e.g., ATCG in this case). The first three lines of every data file consist of the lengths of these three sequences followed by the sequence strings themselves in the next three lines.
Two different datasets were used for this purpose.

The first one is a simulated data (inside the simulated directory) downloaded from Here (the GC content was changed for each sequence downloaded) . The dataset descriptions are given below.

File names / Lengths	Length of 1st Sequence (bp)	Length of 2nd Sequence (bp)
1.txt	128	127
2.txt	256	255
3.txt	512	511
4.txt	1024	1023
5.txt	2048	2047
6.txt	4096	4095
7.txt	8192	8191
8.txt	16384	16383
9.txt	32768	32767
10.txt	65536	65535
11.txt	131072	131071

The second dataset (inside real_data directory) consists of real DNA sequences of viruses and eukaryotes (source: NCBI). The first 8 files consist of virus genomes whereas, the last 2 files consist of genomes of entire chromosomes of two eukaryotes.

File names / Sequences	Sequence 1	Sequence 2
1.txt	Potato spindle tuber viroid (360 bp)	Tomato apical stunt viroid (359 bp)
2.txt	Rottboellia yellow mottle virus (4194 bp)	Carrot mottle virus (4193 bp)
3.txt	Rehmannia mosaic virus (6395 bp)	Tobacco mosaic virus (6395 bp)
4.txt	Potato virus A (9588 bp)	Soybean mosaic virus N (9585 bp)
5.txt	Chicken megrivirus (9566 bp)	Chicken picornavirus 4 (9564 bp)
6.txt	Microbacterium phage VitulaEligans (17534 bp)	Rhizoctonia cerealis alphaendornavirus 1 (17486 bp)
7.txt	Lucheng Rn rat coronavirus (28763 bp)	Helicobacter phage Pt1918U (28760 bp)
8.txt	Lactococcus phage ASCC368 (32276 bp)	Uncultured Mediterranean phage uvMED (32133 bp)
9.txt	Athene Cunicularia (Chromosome- 25, 1505370 bp)	Bombus Terrestris (Chromosome- LG B18, 3078061 bp)
10.txt	Athene Cunicularia (Chromosome- 25, 1505370 bp)	Bombus Terrestris (Chromosome- LG B01, 16199981 bp)

How to Run?

At first, clone or download this repository. Then go to the Tool directory and run make command from Terminal/Command Prompt. This will create an executable file named find_lcs.

Prepare the Input

At first get the sequence files (e.g., FASTA) to compare from a convenient source (e.g., NCBI, Simulated Data from UCR etc.). Make sure the files consist of only nucleotide bases (A, T, C, or G). Therefore, you may need to remove the description lines (lines started with a >), and any other characters (newline character, whitespaces etc.) from the file.
After collecting the sequence files and removing unnecessary characters from it, create the input text file in the following manner.

Put the sequence lengths (as integer value) of the two sequences in two lines.
Then put the sequence length of unique characters (4) in the third line.
After that, put the sequence strings in three consecutive lines. The first two lines are for the sequences to compare, and the third lines will be for the unique characters (in this case ATCG).

Following is an example of a small input file (dummy_input.txt) where the first sequence consists of 16 base pairs, and the second sequence consists of 15 base pairs.

16
15
4
ATATTTCCAAGGACCC
ATTTCCCCCAAGGCA
ATCG

Execution

Now to get the longest common subsequence (LCS) of two sequences, find the path to your desired data file. Then, run the following command
./find_lcs dumm_input.txt > output.txt.
This will use the dummy_input.txt file as your input and write the output in output.txt file.
N.B. This will use the maximum number of threads available in your PC.

Understanding the output

After the completion of the execution, one can find the output of in the output.txt file. The output consists of three parts. The first part provides a summary of the input sequences. The second part provides the number of threads, LCS length, parcentage of match between the two sequences, and the total time taken (in seconds) for the program. Following is the output file by using the ../data/simulated/11.txt file as input and using 4 threads for parallelization.

Your input file: ../data/simulated/11.txt
Length of sequence 1: 131072 bp
Length of sequence 2: 131071 bp

######## Results ########
Number of threads used: 4
Length of the LCS is: 127963
97.63% of the first sequence matches with second one
Total time taken: 112.497263 seconds

N.B. With the increase of sequence lengths and number of threads, the timings of the parallel program improve significantly.
A video tutorial on how to use this tool can be found here: Video Tutorial

If you have any questions, please contact the author by email at shikderr@myumanitoba.ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!