The LSH algorithm consists of 3 parts:
- Shingling: Shingling is performed by running Shingling.py. This generates the pickle files necessary for further steps
- Minhashing: Minhashing.py generates the signature matrix using shingles generated by running Shinling.py
- Bucketing: LSH.py creates the buckets using signature matrix
We can also query the documents using Main_Retrieval.py. This returns the similar documents found through LSH
sudo nano Parameters.py
Used to edit/check parameters k, r, b, etc. and the MinHash function.
python3 Shingling.py
python3 Minhashing.py
python3 LSH.py
Performs the LSH operations on the entire data set. Should be run once, or whenever the code/parameters/data are updated.
python3 Main_Retrieval
Initiates system to take user query and return matching sequences