基于规则的双向最大匹配算法的分词

双向最大匹配算法(Bi-directction Matching method)是将最大匹配法得到的分词结果和逆向最大匹配法得到的结果通过双向最大匹配算法的规则进行筛选而得到。

#-*- coding:utf-8 -*-
'''
@project: exuding-nlp-all
@author: texuding
@time: 2019-05-27 15:45:46 
'''
#正向最大匹配算法
class MM(object):
    def __init__(self):
        self.window_size = 3

    def cut(self,text,dict):
        result = []
        index = 0
        text_length = len(text)
        while text_length > index:
            for size in range(self.window_size+index,index,-1):
                piece = text[index:size]
                #print(piece)
                if piece in dict:
                    index = size -1
                    #print(index,'-->')
                    break
            index = index +1
            #print(index)
            result.append(piece)
        return result
#逆向最大匹配算法
class RMM(object):
    def __init__(self):
        self.window_size = 3

    def cut(self,text,dict):
        result = []
        index = len(text)
        while index>0:
            for size in range(index-self.window_size,index):
                piece = text[size:index]
                if piece in dict:
                    index = size +1
                    break
            index = index -1
            result.append(piece)
        result.reverse()
        return result


if __name__ =='__main__':
    text = '我在北京师范大学学习,一样找不到女朋友'
    dict = ['我', '在', '北京师范大学', '北京', '师范大学', '大学','学习','北京师范','一样','找','找不到','不到','女朋友','朋友']

    tokenizer_rmm = RMM()
    res_rmm = tokenizer_rmm.cut(text,dict)
    tokenizer_mm = MM()
    res_mm = tokenizer_mm.cut(text,dict)
    #双向最大匹配规则
    if len(res_mm) == len(res_rmm):
        if res_rmm == res_mm:
            res = res_rmm
        else:
            temp_rmm = []
            for i in res_rmm:
                if len(i) == 1:
                    temp_rmm.append(i)
            temp_mm =[]
            for j in res_mm:
                if len(j) == 1:
                    temp_mm.append(i)
            res = [res_rmm if len(temp_rmm) < len(temp_mm) else res_mm]
    else:
        res = [res_rmm if len(res_rmm)<len(res_mm) else res_mm]

    print(res)

输出结果:
[‘我’, ‘在’, ‘北京’, ‘师’, ‘范’, ‘大学’, ‘学习’, ‘,’, ‘一样’, ‘找不到’, ‘女朋友’]
可以看到分词结果对于北京师范大学没有良好的分开,这是规则分词的弊端造成的,当我把self.window_size设置为5时,分词结果如下:
[‘我’, ‘在’, ‘北京师范’, ‘大学’, ‘学习’, ‘,’, ‘一样’, ‘找不到’, ‘女朋友’]
所以某些词,尤其是长一点的实体词长度大于窗口大小时会影响分词结果。后面会继续研究一下基于HMM模型的统计分词来对比基于规则分词的弊端。

猜你喜欢