Skip to content

Semester project for Computational Genomics: Sequences, Fall 2019

License

Notifications You must be signed in to change notification settings

dr-irani/MinHash-Genomics-Fingerprinting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CS447 Final Project:

Repository Organization

cs447-final-project/
  |- README.md
  |- minhash_combined.py --> source code for all MinHash implementations/modifications
  |- edit_distance_DP.py --> Needleman-Wunsch implementation (Provided in lecture)
  |- generate_data.py --> For generating synthetic reads and inserting edits to given sequence
  |- make_synth_data.py --> Script that calls generate_data.py and generates random samples used in our analysis
  |- process_fastq.py --> For extracting only the sequences from the PACBIO .fastq file
  |- profiler.py --> code for running time benchmarks for each MinHash implementation
  |- run_DP.py --> Runs edit_distance_DP.py for the all sequence pairs
  |- run_experiment.py --> Calculates estimated edit distance using the computed Jaccard similarities
  |- tune_parameters.py --> Iterates through combinations of k and L pairs to compute Jaccard similarity
  |- data/
     |- README.md
     |- ...
  |- notebooks/
     |- ...
  |- output/
     |- ...
  |- profile_out/
     |- ...
  |- synth_data/
     |- ...
  |- synth_out/
     |- ...
  |- tuning/
     |- ecoli/
        |- ...
     |- synth/
        |- ...
     |- ...

Source for Needleman-Wunsch Implementation: Jupyter Notebook Link

MinHash Genomic Fingerprinting to Estimate Edit Distance

Data Generation

Example commands of generating synthetic reads can be found in /data/READMEmd.

MinHash

All of our implementations for MinHash are contained in minhash_combined.py The following parameters are used as arguments when calling the program.

-f1: "Path of sequence1 .txt file"
-f2: "Path of sequence2 .txt file"
-n: "Length of fingerprint"
-k: "Length of kmer"
-s: "Length of stride"
-m: "Toggle multi hash layers" (optional)

Here is an example command to compare files file1 and file2 using 128 hash functions, k-mer size 16, and stride length of 1.

python minhash_combined.py -f1=file1 -f2=file2 -method=khash -n=128 -k=16 -s=1

Needleman-Wunsch

run_DP.py is used to compute the true edit distance. The script calls edit_distance_DP.py.

  • For the synthetic data, all the sequences were generated as separate .txt files. The input sequence files are in /data/synth_*.txt(10 pairs) and /synth_data/synth_*.txt(1 test and 5 edited sequences each, yields a total of 50 pairs).
  • For the genomic data, all the sequences are in /data/ecoli_realdata.txt. The first 600 sequences were used.
  • Note that this code takes a long time to run! We ran our code for 12 hours.

Tuning Parameters

tune_parameters.py is used to iterate through multiple combinations of k (k-mer length) and L (size of fingerpring).

  • main is used to run the functions tune and get_jaccard. This will calculate the true Jaccard simlarity and the estimates of the four different MinHash implementations for all the combinations of k and L.
  • main_summary is used to run summary for each k, L pair and uses the files generated by main to compute summary metrics. It returns the average error of the Jaccard similarity estimate for the MinHash algorithms, and returns a list of the true edit distance (this file already exists and must be passed in as an input) and the true Jaccard similarity.

Estimating Edit Distance

The run_experiment.py is used to compute the estimated edit distance using the computed Jaccard simlilarities. The output file will have the following:

  • Length of the 2 sequences (len_x, len_y)
  • True Jaccard and estimated Jaccard from the 4 algorithms: TJ, J1, J2, J3, J4
  • True edit distance computed from Needleman-Wunsch: trueED
  • 3 Estimated edit distances for each of the Jaccard simlarity values (lower, upper bound from IEEE paper, our point estimate): TED (from TJ), ED1, ED2, ED3, ED4 (from J1, J2, J3, J4 respectively).

Benchmarking

The profiler.py is used to benchmark time on the four MinHash implementations. This profiling is done using a common set of parameters: 16 k-mer length, 128 hash functions, stride length 1. They are provided two 8001-bp sequences with an true edit distance of 98. Output of the script should be piped into a new file, which gives time and function call stats sorted by total time to run.

Inside minhash_combined.py there are commented out @profile blocks. If you uncomment from memory_profiler import profile and these @profile sections, then you'll output the memory requirements for certain functions in the file.

About

Semester project for Computational Genomics: Sequences, Fall 2019

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •