Simple and efficient SimHash implementation written in Go. SimHash is a locality-sensitive hashing algorithm used to measure text similarity.
go get github.com/ayhanozemre/simhashpackage main
import (
"fmt"
"github.com/ayhanozemre/simhash"
)
func main() {
text := "The cat sat on the mat"
hash := simhash.NewSimHash(text)
fmt.Printf("Hash: %d\n", hash)
}hash := simhash.NewSimHash(text, simhash.WithStopWords("the", "on", "a"))hash := simhash.NewSimHash(text, simhash.WithNgram(4))hash := simhash.NewSimHash(text, simhash.WithMinTokenLength(3))hash := simhash.NewSimHash(text,
simhash.WithStopWords("the", "on"),
simhash.WithMinTokenLength(3),
)text1 := "The cat sat on the mat"
text2 := "The cat sat on the rug"
hash1 := simhash.NewSimHash(text1)
hash2 := simhash.NewSimHash(text2)
distance := simhash.HammingDistance(hash1, hash2)
fmt.Printf("Hamming distance: %d\n", distance)Smaller Hamming distance values indicate that texts are more similar.
go test -vMIT License - See LICENSE file for details.
- Detecting Near-Duplicates for Web Crawling - Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma (WWW 2007)