余弦相似度公式
image.png
1.分词(目前只分成了字符串)
[你,好]
[你,好,不,好]
2.词频计算
"你好" [你1, 好1]
"你好不好" [你1, 好2, 不1]
3.并集
[你1,好1, 不0]
[你1,好2,不1]
4.计算余弦
["你","好"]
["你","好","不","好"]
相似度1:0.8660254037844387
["你","好","不"]
["你","好","不","好"]
相似度2:0.9428090415820635
["你","好","不","好"]
["你","好","不","好"]
相似度3:1
["你","你","你","你"]
["你","好","不","好"]
相似度4:0.4082482904638631
["你"]
["你","好","不","好"]
相似度5:0.4082482904638631
附上代码ArkTS可直接运行,码砖不易,转载请标明出处
export function cosTextSimilarity(simple: string, target: string): number {
//词语分割,暂时分割为单个字符串
let simples: string[] = simple.split("")
//Log.d("zb", JSON.stringify(simples))
let targets: string[] = target.split("")
//Log.d("zb", JSON.stringify(targets))
//词频计算及存储
let simpleMap: Map<string, number> = new Map<string, number>()
simples.forEach((c: string, index: number) => {
let value: number = simpleMap.get(c)?? 0
simpleMap.set(c, value + 1)
})
let targetMap: Map<string, number> = new Map<string, number>()
targets.forEach((c: string, index: number) => {
let value: number = targetMap.get(c)?? 0
targetMap.set(c, value + 1)
})
//并集数组
let merge = simples.concat(targets)
let collectionSet = new Set(merge)
let collectionArr = new Array<string>()
collectionSet.forEach((c: string) => {
collectionArr.push(c)
})
let p3 = 0; let p1 = 0; let p2 = 0
collectionArr.forEach((c: string, index: number) => {
let frequencyS: number = simpleMap.get(c)?? 0
let frequencyT: number = targetMap.get(c)?? 0
p3 += frequencyS * frequencyT
p1 += frequencyS * frequencyS
p2 += frequencyT * frequencyT
})
return p3 *1.0 / Math.sqrt(p1 * p2)
}