CS447 Final Project:

Repository Organization

cs447-final-project/
  |- README.md
  |- minhash_combined.py --> source code for all MinHash implementations/modifications
  |- edit_distance_DP.py --> Needleman-Wunsch implementation (Provided in lecture)
  |- generate_data.py --> For generating synthetic reads and inserting edits to given sequence
  |- make_synth_data.py --> Script that calls generate_data.py and generates random samples used in our analysis
  |- process_fastq.py --> For extracting only the sequences from the PACBIO .fastq file
  |- profiler.py --> code for running time benchmarks for each MinHash implementation
  |- run_DP.py --> Runs edit_distance_DP.py for the all sequence pairs
  |- run_experiment.py --> Calculates estimated edit distance using the computed Jaccard similarities
  |- tune_parameters.py --> Iterates through combinations of k and L pairs to compute Jaccard similarity
  |- data/
     |- README.md
     |- ...
  |- notebooks/
     |- ...
  |- output/
     |- ...
  |- profile_out/
     |- ...
  |- synth_data/
     |- ...
  |- synth_out/
     |- ...
  |- tuning/
     |- ecoli/
        |- ...
     |- synth/
        |- ...
     |- ...

Source for Needleman-Wunsch Implementation: Jupyter Notebook Link

MinHash Genomic Fingerprinting to Estimate Edit Distance

Data Generation

Example commands of generating synthetic reads can be found in /data/READMEmd.

MinHash

All of our implementations for MinHash are contained in minhash_combined.py The following parameters are used as arguments when calling the program.

-f1: "Path of sequence1 .txt file"
-f2: "Path of sequence2 .txt file"
-n: "Length of fingerprint"
-k: "Length of kmer"
-s: "Length of stride"
-m: "Toggle multi hash layers" (optional)

Here is an example command to compare files file1 and file2 using 128 hash functions, k-mer size 16, and stride length of 1.

python minhash_combined.py -f1=file1 -f2=file2 -method=khash -n=128 -k=16 -s=1

Needleman-Wunsch

run_DP.py is used to compute the true edit distance. The script calls edit_distance_DP.py.

For the synthetic data, all the sequences were generated as separate .txt files. The input sequence files are in /data/synth_*.txt(10 pairs) and /synth_data/synth_*.txt(1 test and 5 edited sequences each, yields a total of 50 pairs).
For the genomic data, all the sequences are in /data/ecoli_realdata.txt. The first 600 sequences were used.
Note that this code takes a long time to run! We ran our code for 12 hours.

Tuning Parameters

tune_parameters.py is used to iterate through multiple combinations of k (k-mer length) and L (size of fingerpring).

main is used to run the functions tune and get_jaccard. This will calculate the true Jaccard simlarity and the estimates of the four different MinHash implementations for all the combinations of k and L.
main_summary is used to run summary for each k, L pair and uses the files generated by main to compute summary metrics. It returns the average error of the Jaccard similarity estimate for the MinHash algorithms, and returns a list of the true edit distance (this file already exists and must be passed in as an input) and the true Jaccard similarity.

Estimating Edit Distance

The run_experiment.py is used to compute the estimated edit distance using the computed Jaccard simlilarities. The output file will have the following:

Length of the 2 sequences (len_x, len_y)
True Jaccard and estimated Jaccard from the 4 algorithms: TJ, J1, J2, J3, J4
True edit distance computed from Needleman-Wunsch: trueED
3 Estimated edit distances for each of the Jaccard simlarity values (lower, upper bound from IEEE paper, our point estimate): TED (from TJ), ED1, ED2, ED3, ED4 (from J1, J2, J3, J4 respectively).

Benchmarking

The profiler.py is used to benchmark time on the four MinHash implementations. This profiling is done using a common set of parameters: 16 k-mer length, 128 hash functions, stride length 1. They are provided two 8001-bp sequences with an true edit distance of 98. Output of the script should be piped into a new file, which gives time and function call stats sorted by total time to run.

Inside minhash_combined.py there are commented out @profile blocks. If you uncomment from memory_profiler import profile and these @profile sections, then you'll output the memory requirements for certain functions in the file.

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
__pycache__		__pycache__
data		data
notebooks		notebooks
old_code_and_data		old_code_and_data
output		output
profile_out		profile_out
synth_data		synth_data
synth_output		synth_output
tuning		tuning
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
combine_synthdata_temp.py		combine_synthdata_temp.py
edit_distance_DP.py		edit_distance_DP.py
generate_data.py		generate_data.py
make_synth_data.py		make_synth_data.py
minhash_combined.py		minhash_combined.py
process_fastq.py		process_fastq.py
profiler.py		profiler.py
requirements.txt		requirements.txt
run_DP_updated.py		run_DP_updated.py
run_experiment_updated.py		run_experiment_updated.py
synthdata_8001bp.txt		synthdata_8001bp.txt
tune_parameters_updated.py		tune_parameters_updated.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS447 Final Project:

Repository Organization

MinHash Genomic Fingerprinting to Estimate Edit Distance

Data Generation

MinHash

Needleman-Wunsch

Tuning Parameters

Estimating Edit Distance

Benchmarking

About

Releases

Packages

Contributors 4

Languages

License

dr-irani/MinHash-Genomics-Fingerprinting

Folders and files

Latest commit

History

Repository files navigation

CS447 Final Project:

Repository Organization

MinHash Genomic Fingerprinting to Estimate Edit Distance

Data Generation

MinHash

Needleman-Wunsch

Tuning Parameters

Estimating Edit Distance

Benchmarking

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages