It may become very useful to some to provide approximate substring matching. This reports the smallest edit distance of the needle in all possible substrings of haystack. Here are some examples:
tests := []struct {
needle, haystack string
want int
}{
{"Machu Picchu", "Machu Picchu", 0},
{"Machu Picchu", "Where is Machu Picchu?", 0},
{"Machu Picchu", "Where is Machu Pichu?", 1},
{"Machu Picchu", "Where is Macchu Picchu?", 1},
{"Stonehenge", "How do I get to Stone Henge?", 2},
{"", "", 0},
{"", "Machu Picchu", 0},
{"u", "Machu Picchu", 0},
{"z", "Machu Picchu", 1},
{"Uluru", "", 5},
}
All that is necessary is to initialize the row with all zeroes, run the main algorithm, and then return the smallest value in the row. I have developed some code and was wondering about your thoughts about including it (or some more optimized variant) in this package:
func ComputeSubstringDistance(needle, haystack string) int {
if len(needle) == 0 {
return 0
}
if len(haystack) == 0 {
return utf8.RuneCountInString(needle)
}
if needle == haystack {
return 0
}
// We need to convert to []rune if the strings are non-ascii.
// This could be avoided by using utf8.RuneCountInString
// and then doing some juggling with rune indices.
// The primary challenge is keeping track of the previous rune.
// With needle range loop, its not that easy. And with needle for-loop
// we need to keep track of the inter-rune width using utf8.DecodeRuneInString
s1 := []rune(haystack)
s2 := []rune(needle)
lenS1 := len(s1)
lenS2 := len(s2)
// init the row
x := make([]int, lenS1+1)
// make needle dummy bounds check to prevent the 2 bounds check down below.
// The one inside the loop is particularly costly.
_ = x[lenS1]
// fill in the rest
for i := 1; i <= lenS2; i++ {
prev := i
var current int
for j := 1; j <= lenS1; j++ {
if s2[i-1] == s1[j-1] {
current = x[j-1] // match
} else {
current = min(min(x[j-1]+1, prev+1), x[j]+1)
}
x[j-1] = prev
prev = current
}
x[lenS1] = prev
}
min := x[0]
for i := 1; i <= lenS1; i++ {
if x[i] < min {
min = x[i]
}
}
return min
}
It may become very useful to some to provide approximate substring matching. This reports the smallest edit distance of the
needlein all possible substrings ofhaystack. Here are some examples:All that is necessary is to initialize the row with all zeroes, run the main algorithm, and then return the smallest value in the row. I have developed some code and was wondering about your thoughts about including it (or some more optimized variant) in this package: