cs447-final-project/
|- README.md
|- minhash_combined.py --> source code for all MinHash implementations/modifications
|- edit_distance_DP.py --> Needleman-Wunsch implementation (Provided in lecture)
|- generate_data.py --> For generating synthetic reads and inserting edits to given sequence
|- make_synth_data.py --> Script that calls generate_data.py and generates random samples used in our analysis
|- process_fastq.py --> For extracting only the sequences from the PACBIO .fastq file
|- profiler.py --> code for running time benchmarks for each MinHash implementation
|- run_DP.py --> Runs edit_distance_DP.py for the all sequence pairs
|- run_experiment.py --> Calculates estimated edit distance using the computed Jaccard similarities
|- tune_parameters.py --> Iterates through combinations of k and L pairs to compute Jaccard similarity
|- data/
|- README.md
|- ...
|- notebooks/
|- ...
|- output/
|- ...
|- profile_out/
|- ...
|- synth_data/
|- ...
|- synth_out/
|- ...
|- tuning/
|- ecoli/
|- ...
|- synth/
|- ...
|- ...
Source for Needleman-Wunsch Implementation: Jupyter Notebook Link
Example commands of generating synthetic reads can be found in /data/READMEmd
.
All of our implementations for MinHash are contained in minhash_combined.py
The following parameters are used as arguments when calling the program.
-f1: "Path of sequence1 .txt file"
-f2: "Path of sequence2 .txt file"
-n: "Length of fingerprint"
-k: "Length of kmer"
-s: "Length of stride"
-m: "Toggle multi hash layers" (optional)
Here is an example command to compare files file1
and file2
using 128 hash functions, k-mer size 16, and stride length of 1.
python minhash_combined.py -f1=file1 -f2=file2 -method=khash -n=128 -k=16 -s=1
run_DP.py
is used to compute the true edit distance. The script calls edit_distance_DP.py
.
- For the synthetic data, all the sequences were generated as separate .txt files. The input sequence files are in
/data/synth_*.txt
(10 pairs) and/synth_data/synth_*.txt
(1 test and 5 edited sequences each, yields a total of 50 pairs). - For the genomic data, all the sequences are in
/data/ecoli_realdata.txt
. The first 600 sequences were used. - Note that this code takes a long time to run! We ran our code for 12 hours.
tune_parameters.py
is used to iterate through multiple combinations of k (k-mer length) and L (size of fingerpring).
main
is used to run the functionstune
andget_jaccard
. This will calculate the true Jaccard simlarity and the estimates of the four different MinHash implementations for all the combinations of k and L.main_summary
is used to runsummary
for each k, L pair and uses the files generated bymain
to compute summary metrics. It returns the average error of the Jaccard similarity estimate for the MinHash algorithms, and returns a list of the true edit distance (this file already exists and must be passed in as an input) and the true Jaccard similarity.
The run_experiment.py
is used to compute the estimated edit distance using the computed Jaccard simlilarities. The output file will have the following:
- Length of the 2 sequences (len_x, len_y)
- True Jaccard and estimated Jaccard from the 4 algorithms: TJ, J1, J2, J3, J4
- True edit distance computed from Needleman-Wunsch: trueED
- 3 Estimated edit distances for each of the Jaccard simlarity values (lower, upper bound from IEEE paper, our point estimate): TED (from TJ), ED1, ED2, ED3, ED4 (from J1, J2, J3, J4 respectively).
The profiler.py
is used to benchmark time on the four MinHash implementations. This profiling is done using a common set of parameters: 16 k-mer length, 128 hash functions, stride length 1. They are provided two 8001-bp sequences with an true edit distance of 98. Output of the script should be piped into a new file, which gives time and function call stats sorted by total time to run.
Inside minhash_combined.py
there are commented out @profile
blocks. If you uncomment from memory_profiler import profile
and these @profile
sections, then you'll output the memory requirements for certain functions in the file.