欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

ElasticSearch教程——安装IK分词器插件

程序员文章站 2022-07-12 13:42:55
...

简介

IK Analyzer是一个开源的,基于Java语言开发的轻量级的中文分词工具包,最初的时候,它是以开源项目Lucene为应用主体的,结合词典分词和文法分析算法的中文分词组件,从3.0版本之后,IK逐渐成为面向java的公用分词组件,独立于Lucene项目,同时提供了对Lucene的默认优化实现,IK实现了简单的分词 歧义排除算法,标志着IK分词器从单纯的词典分词向模拟语义分词衍化

 

基础环境

1.基础环境建立在前两篇博客的基础之上,这边IK的版本务必要和elasticsearch一致,否则会报错

2.安装maven

 

下载

git clone https://github.com/medcl/elasticsearch-analysis-ik.git

下载方式有两种,一种是直接用git命令下载,另一种是在windows上下载好后上传到服务器上再进行解压

 

打包编译

按照上述方式下载好后,将项目进行打包

执行如下脚本(需要先按照maven,此处不再赘述,网上相关博文很多):

mvn package

编译完成之后切换路径到项目下的target/releases,找到对应zip包,我这边是elasticsearch-analysis-ik-6.4.0.zip。

将该zip文件拷贝至/usr/elasticsearch/elasticsearch-6.4.0/plugins/ik(此处ik文件夹是自己创建的)下,并进行解压

unzip elasticsearch-analysis-ik-6.4.0.zip

重启ElasticSearch

systemctl restart elasticsearch.service

 

测试IK分词器

ik 带有两个分词器
ik_max_word :会将文本做最细粒度的拆分;尽可能多的拆分出词语
ik_smart:会做最粗粒度的拆分;已被分出的词语将不会再次被其它词语占有

 

更多具体内容可以看官方案例官方测试案例

 

注意:在新版本中需要在请求头中设置请求格式,否则会报错,错误为

"error" : "Content-Type header [application/x-www-form-urlencoded] is not supported"

另外新版本中已经不支持String了,用text代替,输入String会报下错误

org.elasticsearch.index.mapper.MapperParsingException: No handler for type [string] declared on field [content]
	at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseProperties(ObjectMapper.java:274) ~[elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.index.mapper.ObjectMapper$TypeParser.parseObjectOrDocumentTypeProperties(ObjectMapper.java:199) ~[elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.index.mapper.RootObjectMapper$TypeParser.parse(RootObjectMapper.java:131) ~[elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.index.mapper.DocumentMapperParser.parse(DocumentMapperParser.java:112) ~[elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.index.mapper.DocumentMapperParser.parse(DocumentMapperParser.java:92) ~[elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.index.mapper.MapperService.parse(MapperService.java:626) ~[elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.applyRequest(MetaDataMappingService.java:263) ~[elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.execute(MetaDataMappingService.java:229) ~[elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:639) ~[elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:268) ~[elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:198) [elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:133) [elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-6.4.0.jar:6.4.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-6.4.0.jar:6.4.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]

 

ik_smart正确请求方式如下(直接复制粘贴到xshell,回车即可):

curl -H "Content-Type: application/json" 'http://XXX.xx.xx.xx:9200/index/_analyze?pretty=true' -d '  
{  
 "analyzer": "ik_smart",
 "text": "*万岁万岁万万岁"			
}'

返回结果:

{
  "tokens" : [
    {
      "token" : "*",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "万岁",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "万岁",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "万万岁",
      "start_offset" : 11,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 3
    }
  ]
}

 

ElasticSearch教程——安装IK分词器插件

 

 

ik_max_word正确请求方式如下:

curl -H "Content-Type: application/json" 'http://XXX.XXX.xxx:9200/index/_analyze?pretty=true' -d '  
{  
 "analyzer": "ik_max_word",
 "text": "*万岁万岁万万岁"            
}'

返回结果:

{
  "tokens" : [
    {
      "token" : "*",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "中华人民",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "中华",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "华人",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "人民*",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "人民",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "*",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "共和",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "国",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 8
    },
    {
      "token" : "万岁",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 9
    },
    {
      "token" : "万",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "TYPE_CNUM",
      "position" : 10
    },
    {
      "token" : "岁",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "COUNT",
      "position" : 11
    },
    {
      "token" : "万岁",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "CN_WORD",
      "position" : 12
    },
    {
      "token" : "万",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "TYPE_CNUM",
      "position" : 13
    },
    {
      "token" : "岁",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "COUNT",
      "position" : 14
    },
    {
      "token" : "万万岁",
      "start_offset" : 11,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 15
    },
    {
      "token" : "万万",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "TYPE_CNUM",
      "position" : 16
    },
    {
      "token" : "万岁",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "CN_WORD",
      "position" : 17
    },
    {
      "token" : "岁",
      "start_offset" : 13,
      "end_offset" : 14,
      "type" : "COUNT",
      "position" : 18
    }
  ]
}

结果高亮

官方git上也有案例,具体看上面链接

curl -XPOST http://localhost:9200/index/fulltext/4 -H 'Content-Type:application/json' -d'
{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}'
curl -XPOST http://localhost:9200/index/fulltext/_search  -H 'Content-Type:application/json' -d'
{
    "query" : { "match" : { "content" : "中国" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}'

返回结果

{
	"took": 8,
	"timed_out": false,
	"_shards": {
		"total": 5,
		"successful": 5,
		"skipped": 0,
		"failed": 0
	},
	"hits": {
		"total": 1,
		"max_score": 0.5753642,
		"hits": [{
			"_index": "index",
			"_type": "fulltext",
			"_id": "1",
			"_score": 0.5753642,
			"_source": {
				"content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
			},
			"highlight": {
				"content": ["<tag1>中</tag1><tag1>国</tag1>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"]
			}
		}]
	}
}

扩展配置文件

IKAnalyzer.cfg.xml can be located at {conf}/analysis-ik/config/IKAnalyzer.cfg.xml or {plugins}/elasticsearch-analysis-ik-*/config/IKAnalyzer.cfg.xml,意思就是说IKAnalyzer.cfg.xml可以放在上述位置(不过我看了下,该文件在我的/usr/elasticsearch/elasticsearch-6.4.0/plugins/ik/config目录下自带的,相关的扩展字典也在该目录下),文件内容如下

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords">custom/ext_stopword.dic</entry>
 	<!--用户可以在这里配置远程扩展字典 -->
	<entry key="remote_ext_dict">location</entry>
 	<!--用户可以在这里配置远程扩展停止词字典-->
	<entry key="remote_ext_stopwords">http://xxx.com/xxx.dic</entry>
</properties>

 

热更新IK分词

目前该插件支持热更新 IK 分词,通过上面在 IK 配置文件中添加如下配置

<!--用户可以在这里配置远程扩展字典 -->
<entry key="remote_ext_dict">location</entry>
<!--用户可以在这里配置远程扩展停止词字典-->
<entry key="remote_ext_stopwords">location</entry>

其中 location 是指一个 url,比如 http://yoursite.com/getCustomDict,该请求只需满足以下两点即可完成分词热更新。

  1. 该 http 请求需要返回两个头部(header),一个是 Last-Modified,一个是 ETag,这两者都是字符串类型,只要有一个发生变化,该插件就会去抓取新的分词进而更新词库。

  2. 该 http 请求返回的内容格式是一行一个分词,换行符用 \n 即可。

满足上面两点要求就可以实现热更新分词了,不需要重启 ES 实例。

可以将需自动更新的热词放在一个 UTF-8 编码的 .txt 文件里,放在 nginx 或其他简易 http server 下,当 .txt 文件修改时,http server 会在客户端请求该文件时自动返回相应的 Last-Modified 和 ETag。可以另外做一个工具来从业务系统提取相关词汇,并更新这个 .txt 文件。