KMP算法是用来做字符串匹配的,比如目标字符串:'acabacacd’是否能在源字符串:'acfacabacabacacdkacfacabacabacacdk’找到匹配。传统的暴力解法复杂度是O(nk)的,k是目标字符串长度,n是源字符串长度。所以有优化的KMP算法复杂度为O(n)
这种算法的核心在于如何处理某个字符不match的情况。对于传统暴力算法,当某个位置字符不match的时候,就会往下顺移一位,这也是为什么暴力算法的复杂度是O(nk)。而KMP在某个位置字符不match的时候,会跳过某些不需要的比较,牵涉到的核心概念是“Longest Proper Prefix which is Suffix” (lps数组)。lps数组是一个整数数组,长度与目标字符串一致,每个位置的整数值代表“以这个位置index为结尾的后缀传和这个位置的数字作为index的之前的前缀串相同”。这个确实很不好理解,但是总体的意思就是,有了这个数组,当某个位置字符不match的时候,我们就无需一位一位顺移,而是可以跳 (具体的原理非常难以理解,所以建议暂时先理解这个high level的idea)
这个数据计算具体的implementation也是非常的难以理解,虽然只有寥寥几行代码,但是想要完全理解为什么是这样依旧非常困难。如果想要完全搞清楚这个意义,建议看下面两个详细教程:教程1,教程2
下面是根据我浅显的理解写的implementation。更详细的解释在上面两个教程里面也有。想不通了就回去看看
def create_LPS(w): # initialize the array with all 0s LPS = [0] * len(w) # tp means the pointer at the prefix position tp = 0 # bp means the pointer at the suffix position bp = 1 # iterate through the whole word while bp < len(w): # if two pointers points to the same char, we give a value in the lps array and increment two pointers if w[tp] == w[bp]: LPS[bp] = tp + 1 tp += 1 bp += 1 # if two pointers points to different char, there are two posibilities: else: # 1. if tp is 0, we keep tp and put zero value in lps, and only increment bp by one (we don't increment tp because no match char found) if tp == 0: LPS[bp] = 0 bp += 1 # 2. else, comes the part that very difficult to understand, we update the tp using the value in lps else: tp = LPS[tp-1] return LPS
有了这个lps数组,一旦遇到不match的字符,直接根据lps数组跳过无需比较的位置
LPS = create_LPS(word) pt = 0 pw = 0 while pt < len(text): # if current place match, keep going until found the complete match or a mismatch if text[pt] == word[pw]: pt += 1 pw += 1 if pw == len(word): print('found',pt-len(word),pt) # we can jump again if we want to search for multiple matching once one match is already found pw = LPS[pw-1] # if current place does not match, two conditions else: # if pointer in word is 0, means the beginning does not match, we only need to increment the pointer in the text if pw == 0: pt += 1 # if pointer in word is not 0, means there is some place we can jump according to the LPS array else: pw = LPS[pw-1]
在具体的详细教程后面再写吧,确实比较难解释。。。
教程1,教程2