This is a mirror implementation of the Python SetSimilaritySearch library in Go, with better performance.
Run AllPairs
algorithm on 3.5 GHz Intel Core i7,
using similarity function jaccard
and similarity threshold 0.5.
Dataset | Input Sets | Avg. Size | go-set-similarity-search Runtime |
SetSimilaritySearch Runtime |
---|---|---|---|---|
Pokec social network (relationships): from-nodes are set IDs; to-nodes are elements | 1432693 | 27.31 | 1m25s | 10m49s |
LiveJournal: from-nodes are set IDs; to-nodes are elements | 4308452 | 16.01 | 4m11s | 28m51s |
For All-Pairs, it takes an input of a list of sets, and output pairs that meet the similarity threshold.
import (
"fmt"
"go-set-similarity-search"
)
func main() {
// Each raw set must be a slice of unique string tokens.
rawSets := [][]string{
[]string{"a"},
[]string{"a", "b"},
[]string{"a", "b", "c"},
[]string{"a", "b", "c", "d"},
[]string{"a", "b", "c", "d", "e"},
}
// Use frequency order transformation to replace the string tokens
// with integers.
sets, _ := SetSimilaritySearch.FrequencyOrderTransform(rawSets)
// Run all-pairs algorithm, get a channel of pairs.
pairs, _ := SetSimilaritySearch.AllPairs(sets,
/*similarityFunctionName=*/"jaccard",
/*similarityThreshold=*/0.1)
for pair := range pairs {
// X and Y are indexes to the original rawSets and sets slices.
fmt.Println(pair.X, pair.Y, pair.Similarity)
}
}
For Query, it takes an input of a list of sets, and builds a search index that can compute any number of queries. Currently the search index only supports a static collection of sets with no updates.
import (
"fmt"
"go-set-similarity-search"
)
func main() {
// Each raw set must be a slice of unique string tokens.
rawSets := [][]string{
[]string{"a"},
[]string{"a", "b"},
[]string{"a", "b", "c"},
[]string{"a", "b", "c", "d"},
[]string{"a", "b", "c", "d", "e"},
}
// Use frequency order transformation to replace the string tokens
// with integers.
sets, dict := SetSimilaritySearch.FrequencyOrderTransform(rawSets)
// Build a search index.
searchIndex, err := SetSimilaritySearch.NewSearchIndex(sets,
/*similarityFunctionName=*/"jaccard",
/*similarityThreshold=*/0.1)
// Use dictionary to transform a query set.
querySet := dict.Transform([]string{"a", "c", "d"})
// Query the search index.
searchResults := searchIndex.Query(querySet)
for _, result := range searchResults {
// X is the index to the original rawSets and sets slices.
fmt.Println(result.X, result.Similarity)
}
}
Supported similarity functions (more to come):
- Jaccard: intersection size divided by union size; set
similarityFunctionName="jaccard"
. - Cosine: intersection size divided by square root of the product of sizes; set
similarityFunctionName="cosine"
. - Containment: intersection size divided by the size of the first set (or query set); set
similarityFunctionName="containment"
.