## 字符串匹配技术

### 最小编辑距离(Levenshtein distance)

‘Lvensshtain’ –>(插入e) ‘Levensstain’
‘Levensshtain’ –>(删除s) ‘Levenshtain’
‘Levenshtain’ –>(替换a 为 e) ‘Levenshtein’

import difflib
def edit_distance(strA, strB):
""" Cal Levenshtein distance """
leven_cost = 0
s = difflib.SequenceMatcher(None, strA, strB)
for tag, i1, i2, j1, j2 in s.get_opcodes():
if tag == 'replace':
leven_cost += max(i2-i1, j2-j1)
elif tag == 'insert':
leven_cost += (j2-j1)
elif tag == 'delete':
leven_cost += (i2-i1)
return leven_cost

### Dice 系数

Dice系数用于度量两个集合的相似性，因为可以把字符串理解为一种集合，因此Dice距离也会用于度量字符串的相似性，Dice系数定义如下：

def dice(strA, strB):
""" Cal Dice coefficient """
if len(strA) + len(strB) == 0:
return 1
overlap = intersection(strA, strB)
return 2.0*overlap/(len(strA) + len(strB))
def intersection(strA, strB):
"""  Calculate the number of same strings between two strings  """
if not strA or not strB:
return 0
strA_list = list(strA)
strB_list = list(strB)
overlap = 0
for sa in strA_list:
if sa in strB_list:
overlap += 1
strB_list.pop(strB_list.index(sa))
return overlap

### Jaccard系数

Jaccard 系数适合处理短文本的相似度，定义如下：

def Jaccard(strA, strB):
""" Cal Jaccard coefficient """
if len(strA) + len(strB) == 0:
return 1
overlap = intersection(strA, strB)
return overlap/(len(strA) + len(strB) - overlap)
def intersection(strA, strB):
"""  Calculate the number of same strings between two strings  """
if not strA or not strB:
return 0
strA_list = list(strA)
strB_list = list(strB)
overlap = 0
for sa in strA_list:
if sa in strB_list:
overlap += 1
strB_list.pop(strB_list.index(sa))
return overlap

### TF-IDF

TF-IDF 主要用来评估某个字或者用某个词对一个文档的重要程度。其中：