strutil

参考文档: https://github.com/adrg/strutil

strutil provides a collection of string metrics for calculating string similarity as well as other string utility functions.

String metrics

The package defines the StringMetric interface, which is implemented by all the string metrics. The interface is used with the Similarity function, which calculates the similarity between the specified strings, using the provided string metric.

type StringMetric interface {
    Compare(a, b string) float64
}

func Similarity(a, b string, metric StringMetric) float64 {
}

All defined string metrics can be found in the metrics package.

Hamming

Calculate similarity.

similarity := strutil.Similarity("text", "test", metrics.NewHamming())
fmt.Printf("%.2f\n", similarity) // Output: 0.75

Calculate distance.

ham := metrics.NewHamming()
fmt.Printf("%d\n", ham.Distance("one", "once")) // Output: 2

Levenshtein(编辑距离)

Calculate similarity using default options.

similarity := strutil.Similarity("graph", "giraffe", metrics.NewLevenshtein())
fmt.Printf("%.2f\n", similarity) // Output: 0.43

Configure edit operation costs.

lev := metrics.NewLevenshtein()
lev.CaseSensitive = false
lev.InsertCost = 1
lev.ReplaceCost = 2
lev.DeleteCost = 1

similarity := strutil.Similarity("make", "Cake", lev)
fmt.Printf("%.2f\n", similarity) // Output: 0.50

Calculate distance.

lev := metrics.NewLevenshtein()
fmt.Printf("%d\n", lev.Distance("graph", "giraffe")) // Output: 4

Jaro

similarity := strutil.Similarity("think", "tank", metrics.NewJaro())
fmt.Printf("%.2f\n", similarity) // Output: 0.78

Jaro-Winkler

similarity := strutil.Similarity("think", "tank", metrics.NewJaroWinkler())
fmt.Printf("%.2f\n", similarity) // Output: 0.80

Smith-Waterman-Gotoh

Calculate similarity using default options.

swg := metrics.NewSmithWatermanGotoh()
similarity := strutil.Similarity("times roman", "times new roman", swg)
fmt.Printf("%.2f\n", similarity) // Output: 0.82

Customize gap penalty and substitution function.

swg := metrics.NewSmithWatermanGotoh()
swg.CaseSensitive = false
swg.GapPenalty = -0.1
swg.Substitution = metrics.MatchMismatch {
    Match:    1,
    Mismatch: -0.5,
}

similarity := strutil.Similarity("Times Roman", "times new roman", swg)
fmt.Printf("%.2f\n", similarity) // Output: 0.96

Sorensen-Dice

Calculate similarity using default options.

sd := metrics.NewSorensenDice()
similarity := strutil.Similarity("time to make haste", "no time to waste", sd)
fmt.Printf("%.2f\n", similarity) // Output: 0.62

Customize n-gram size.

sd := metrics.NewSorensenDice()
sd.CaseSensitive = false
sd.NgramSize = 3

similarity := strutil.Similarity("Time to make haste", "no time to waste", sd)
fmt.Printf("%.2f\n", similarity) // Output: 0.53

Jaccard

Calculate similarity using default options.

j := metrics.NewJaccard()
similarity := strutil.Similarity("time to make haste", "no time to waste", j)
fmt.Printf("%.2f\n", similarity) // Output: 0.45

Customize n-gram size.

j := metrics.NewJaccard()
j.CaseSensitive = false
j.NgramSize = 3

similarity := strutil.Similarity("Time to make haste", "no time to waste", j)
fmt.Printf("%.2f\n", similarity) // Output: 0.36

The input of the Sorensen-Dice example is the same as the one of Jaccard because the metrics bear a resemblance to each other. In fact, each of the coefficients can be used to calculate the other one.

Sorensen-Dice to Jaccard.

J = SD/(2-SD)

where SD is the Sorensen-Dice coefficient and J is the Jaccard index.

Jaccard to Sorensen-Dice.

SD = 2*J/(1+J)

where SD is the Sorensen-Dice coefficient and J is the Jaccard index.

Overlap Coefficient

Calculate similarity using default options.

oc := metrics.NewOverlapCoefficient()
similarity := strutil.Similarity("time to make haste", "no time to waste", oc)
fmt.Printf("%.2f\n", similarity) // Output: 0.67

Customize n-gram size.

oc := metrics.NewOverlapCoefficient()
oc.CaseSensitive = false
oc.NgramSize = 3

similarity := strutil.Similarity("Time to make haste", "no time to waste", oc)
fmt.Printf("%.2f\n", similarity) // Output: 0.57

示例

电影名计算相似度

代码:

package main

import (
        "fmt"

        "github.com/wanglu119/me-deps/strutil"
        "github.com/wanglu119/me-deps/strutil/metrics"
)

func main() {
        s1 := "名侦探柯南:百万美元的五棱星"
        s2 := "-~名侦探柯南 百万美元的五棱星 -"
        similarity := strutil.Similarity(s1, s2, metrics.NewHamming())
        fmt.Printf("Hamming: %.2f\n", similarity)

        similarity = strutil.Similarity(s1, s2, metrics.NewLevenshtein())
        fmt.Printf("Levenshtein: %.2f\n", similarity)

        similarity = strutil.Similarity(s1, s2, metrics.NewJaro())
        fmt.Printf("Jaro: %.2f\n", similarity)

        similarity = strutil.Similarity(s1, s2, metrics.NewJaroWinkler())
        fmt.Printf("JaroWinkler: %.2f\n", similarity)

        similarity = strutil.Similarity(s1, s2, metrics.NewSmithWatermanGotoh())
        fmt.Printf("SmithWatermanGotoh: %.2f\n", similarity)

        similarity = strutil.Similarity(s1, s2, metrics.NewSorensenDice())
        fmt.Printf("SorensenDice: %.2f\n", similarity)

        similarity = strutil.Similarity(s1, s2, metrics.NewJaccard())
        fmt.Printf("Jaccard: %.2f\n", similarity)

        similarity = strutil.Similarity(s1, s2, metrics.NewOverlapCoefficient())
        fmt.Printf("OverlapCoefficient: %.2f\n", similarity)
}

输出:

Hamming: 0.00
Levenshtein: 0.72
Jaro: 0.88
JaroWinkler: 0.88
SmithWatermanGotoh: 0.86
SorensenDice: 0.73
Jaccard: 0.58
OverlapCoefficient: 0.85

字符串:

s1 := "名侦探柯南:百万美元的五棱星"
s2 := "-~名侦探柯南 百万"

结果:

Hamming: 0.00
Levenshtein: 0.36
Jaro: 0.73
JaroWinkler: 0.73
SmithWatermanGotoh: 0.60
SorensenDice: 0.45
Jaccard: 0.29
OverlapCoefficient: 0.56

字符串:

s1 := "名侦探柯南:百万美元的五棱星"
s2 := "-~名侦探柯南 混淆 百万美元的五棱星 -ssss测试"

结果:

Hamming: 0.00
Levenshtein: 0.48
Jaro: 0.80
JaroWinkler: 0.80
SmithWatermanGotoh: 0.75
SorensenDice: 0.56
Jaccard: 0.39
OverlapCoefficient: 0.85

字符串:

s1 := "功夫"
s2 := "功夫瑜伽"

结果:

Hamming: 0.50
Levenshtein: 0.50
Jaro: 0.83
JaroWinkler: 0.87
SmithWatermanGotoh: 1.00
SorensenDice: 0.50
Jaccard: 0.33
OverlapCoefficient: 1.00

字符串:

s1 := "功夫熊猫"
s2 := "功夫瑜伽"

结果:

Hamming: 0.50
Levenshtein: 0.50
Jaro: 0.67
JaroWinkler: 0.73
SmithWatermanGotoh: 0.50
SorensenDice: 0.33
Jaccard: 0.20
OverlapCoefficient: 0.33

字符串:

s1 := "功夫熊猫1"
s2 := "功夫熊猫2"

结果:

Hamming: 0.80
Levenshtein: 0.80
Jaro: 0.87
JaroWinkler: 0.92
SmithWatermanGotoh: 0.80
SorensenDice: 0.75
Jaccard: 0.60
OverlapCoefficient: 0.75

总结

测试使用OverlapCoefficient效果更好大于0.8的结果比较可信,但OverlapCoefficient结果比较高时,需要对比一下编辑距离,例如:

s1 := "功夫"
s2 := "功夫瑜伽"

Hamming: 0.50
Levenshtein: 0.50
Jaro: 0.83
JaroWinkler: 0.87
SmithWatermanGotoh: 1.00
SorensenDice: 0.50
Jaccard: 0.33
OverlapCoefficient: 1.00