一、词项相似度
elasticsearch支持拼写纠错,其建议词的获取就需要进行词项相似度的计算;今天我们来通过不同的距离算法来学习一下词项相似度算法;
二、数据准备
计算词项相似度,就需要首先将词项向量化;我们可以使用以下两种方法
字符向量化,其将每个字符映射为一个唯一的数字,我们可以直接使用字符编码即可;
import numpy as np def vectorize_words(words): lower_words = [word.lower() for word in words] words = [np.array([ord(c) for c in word]) for word in lower_words] return words vlan = 'valn' vlna = 'vlna' vlan233 ='vlan233' http='http' vlan_vector, vlna_vector, vlan233_vector, http_vector = vectorize_words([vlan, vlna, vlan233, http]) print(f'vlan_vector: {vlan_vector}') print(f'vlna_vector: {vlna_vector}') print(f'vlan233_vector: {vlan233_vector}') print(f'http_vector: {http_vector}') vlan_vector: [118 97 108 110] vlna_vector: [118 108 110 97] vlan233_vector: [118 108 97 110 50 51 51] http_vector: [104 116 116 112]
三、汉明距离
汉明距离是非常流行的距离度量方法,在信息和通信领域中有广泛的使用;其表示两个长度相等的字符串之间互异字符或符号的数量。考虑长度为n的两个词项u和v,汉明距离的数学表达式为:
\[hd(u,v)=\sum_{i}^{n}(u_{i}\ne v_{i} ) \]同时也可以通过除以词的总长度来计算归一化的汉明距离
\[norm\_hd(u,v) = \frac {\sum_{i}^{n}(u_{i}\ne v_{i} )} {n} \]我们使用以下的hamming_distance来计算汉明编辑距离,并通过参数norm来控制是否进行归一化;
def hamming_distance(u, v, norm=True): if u.shape != v.shape: raise ValueError('the vector must have equal lenghts.') return (u!=v).mean() if norm else (u!=v).sum()
我们通过以下代码来计算valn跟其他三个词的汉明距离;
通过输出信息我们可以看到最小的编辑距离是2,归一化之后是0.5;
vlan = 'vlan' vlna = 'vlna' http='http' words = [vlan, vlna, http] input_word = 'valn' input_vector = vectorize_words([input_word]).pop() word_vectors = vectorize_words(words) for word, vector in zip(words, word_vectors): print(f'{input_word} and {word} hamming distance is {hamming_distance(input_vector, vector, norm=False)}') print() for word, vector in zip(words, word_vectors): print(f'{input_word} and {word} hamming distance is {hamming_distance(input_vector, vector)}') valn and vlan hamming distance is 2 valn and vlna hamming distance is 3 valn and http hamming distance is 4 valn and vlan hamming distance is 0.5 valn and vlna hamming distance is 0.75 valn and http hamming distance is 1.0
四、曼哈顿距离
曼哈顿距离主要计算两个字符串每个位置上的字符之间的差值之和;曼哈顿距离也称为城市街区距离、L1范数、计程车度量;
同样长度为n的两个词u、v,曼哈顿距离的数学公式为
\[md(u,v)=\|u - v\|_{1} = \sum_{i=1}^{n}|u_{i} - v_{i}| \]同样我们也可以除以词的长度来计算曼哈顿规划距离
\[norm\_md(u,v)=\frac {\|u - v\|_{1}} {n} = \frac {\sum_{i=1}^{n}|u_{i} - v_{i}|} {n} \]我们可以使用以下方法来计算曼哈顿距离,同样通过norm来控制归一化;
def manhattan_distance(u, v, norm=True): if u.shape != v.shape: raise ValueError('the vector must have equal lenghts.') return abs(u-v).mean() if norm else abs(u-v).sum()
使用同样的词,使用以下代码计算曼哈顿距离;
同样可以看到距离最小的valn和vlan的曼哈顿距离是22,归一化之后是5.5;
vlan = 'vlan' vlna = 'vlna' http='http' words = [vlan, vlna, http] input_word = 'valn' input_vector = vectorize_words([input_word]).pop() word_vectors = vectorize_words(words) for word, vector in zip(words, word_vectors): print(f'{input_word} and {word} manhattan distance is {manhattan_distance(input_vector, vector, norm=False)}') print() for word, vector in zip(words, word_vectors): print(f'{input_word} and {word} manhattan distance is {manhattan_distance(input_vector, vector)}') valn and vlan manhattan distance is 22 valn and vlna manhattan distance is 26 valn and http manhattan distance is 43 valn and vlan manhattan distance is 5.5 valn and vlna manhattan distance is 6.5 valn and http manhattan distance is 10.75
五、欧几里得距离
欧几里得距离计算两点之间最短的直线距离,也称为欧几里得范数、L2范数或L2距离;
同样长度为n的两个词u、v,欧几里得距离的数学公式为
\[ed(u,v)=\|u - v\|_{2} = \sqrt{\sum_{i=1}^{n}(u_{i} - v_{i})^2} \]我们使用以下方法计算欧几里得距离
def euclidean_distance(u, v): if u.shape != v.shape: raise ValueError('the vector must have equal lenghts.') return np.sqrt(np.sum(np.square(u - v)))
同样的关键字,使用以下代码计算欧几里得距离;
vlan = 'vlan' vlna = 'vlna' http='http' words = [vlan, vlna, http] input_word = 'valn' input_vector = vectorize_words([input_word]).pop() word_vectors = vectorize_words(words) for word, vector in zip(words, word_vectors): print(f'{input_word} and {word} euclidean distance is {euclidean_distance(input_vector, vector)}') valn and vlan euclidean distance is 15.556349186104045 valn and vlna euclidean distance is 17.146428199482248 valn and http euclidean distance is 25.0