This program use the Min Hash algorithm to compute the Jaccard Similarity between files.
- The files are shingled first to save space.
- Using upper triangular matrix instead of plain matrix to save space.
- Implement write buffer, small writes are firstly stored in the memory, only when the buffer becomes full that it writes to the disk. This reduce the number of write, thus save running time.
- Using Locality-Sensitive Hasing to further simplify the comparison between signature matrix.
- min_hash.c min_hash.h the min_hash algorithm, generating files automatically and print the result.
- CRC32.cpp CRC32.h CRC32 hash algorithm that is used in min hash