elasticsearch之中文分词器插件es-ik

大数据和人工智能躺过的坑 大数据和人工智能躺过的坑     2022-08-23     576

关键词:

 

 

 

 

前提

什么是倒排索引?

Elasticsearch之分词器的作用

Elasticsearch之分词器的工作流程

Elasticsearch之停用

Elasticsearch之中文分词器

Elasticsearch之几个重要的分词器

 

 

 

 

 

 

 

 

elasticsearch官方默认的分词插件

  1、elasticsearch官方默认的分词插件,对中文分词效果不理想。

  比如,我现在,拿个具体实例来展现下,验证为什么,es官网提供的分词插件对中文分词而言,效果差

[hadoop@HadoopMaster elasticsearch-2.4.3]$ jps
2044 Jps
1979 Elasticsearch
[hadoop@HadoopMaster elasticsearch-2.4.3]$ pwd
/home/hadoop/app/elasticsearch-2.4.3
[hadoop@HadoopMaster elasticsearch-2.4.3]$ curl 'http://192.168.80.10:9200/zhouls/_analyze?pretty=true' -d '{"text":"这里是好记性不如烂笔头感叹号的博客园"}'
{
"tokens" : [ {
"token" : "",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
}, {
"token" : "",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
}, {
"token" : "",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
}, {
"token" : "",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
}, {
"token" : "",

 

"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4
}, {
"token" : "",
"start_offset" : 5,
"end_offset" : 6,
"type" : "<IDEOGRAPHIC>",
"position" : 5
}, {
"token" : "",
"start_offset" : 6,
"end_offset" : 7,
"type" : "<IDEOGRAPHIC>",
"position" : 6
}, {
"token" : "",
"start_offset" : 7,
"end_offset" : 8,
"type" : "<IDEOGRAPHIC>",
"position" : 7
}, {
"token" : "",
"start_offset" : 8,
"end_offset" : 9,
"type" : "<IDEOGRAPHIC>",
"position" : 8
}, {
"token" : "",

"start_offset" : 9,
"end_offset" : 10,
"type" : "<IDEOGRAPHIC>",
"position" : 9
}, {
"token" : "",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<IDEOGRAPHIC>",
"position" : 10
}, {
"token" : "",
"start_offset" : 11,
"end_offset" : 12,
"type" : "<IDEOGRAPHIC>",
"position" : 11
}, {
"token" : "",
"start_offset" : 12,
"end_offset" : 13,
"type" : "<IDEOGRAPHIC>",
"position" : 12
}, {
"token" : "",
"start_offset" : 13,
"end_offset" : 14,
"type" : "<IDEOGRAPHIC>",
"position" : 13
}, {
"token" : "",

"start_offset" : 14,
"end_offset" : 15,
"type" : "<IDEOGRAPHIC>",
"position" : 14
}, {
"token" : "",
"start_offset" : 15,
"end_offset" : 16,
"type" : "<IDEOGRAPHIC>",
"position" : 15
}, {
"token" : "",
"start_offset" : 16,
"end_offset" : 17,
"type" : "<IDEOGRAPHIC>",
"position" : 16
}, {
"token" : "",
"start_offset" : 17,
"end_offset" : 18,
"type" : "<IDEOGRAPHIC>",
"position" : 17
} ]
}
[hadoop@HadoopMaster elasticsearch-2.4.3]$

 

 

 

总结

     如果直接使用Elasticsearch的朋友在处理中文内容的搜索时,肯定会遇到很尴尬的问题——中文词语被分成了一个一个的汉字,当用Kibana作图的时候,按照term来分组,结果一个汉字被分成了一组。

     这是因为使用了Elasticsearch中默认的标准分词器,这个分词器在处理中文的时候会把中文单词切分成一个一个的汉字,因此引入es之中文的分词器插件es-ik就能解决这个问题

 

 

 

 

 

 

 

 

 

如何集成IK分词工具

   总的流程如下:

第一步:下载es的IK插件https://github.com/medcl/elasticsearch-analysis-ik/tree/2.x

第二步:使用maven对下载的es-ik源码进行编译(mvn clean package -DskipTests)

第三步:把编译后的target/releases下的elasticsearch-analysis-ik-1.10.3.zip文件拷贝到ES_HOME/plugins/ik目录下面,然后使用unzip命令解压

    如果unzip命令不存在,则安装:yum install -y unzip

第四步:重启es服务

第五步:测试分词效果: curl 'http://your ip:9200/zhouls/_analyze?analyzer=ik_max_word&pretty=true' -d '{"text":"这里是好记性不如烂笔头感叹号的博客们"}'

   注意:若你是单节点的es集群的话,则只需在一台部署es-ik。若比如像我这里的话,是3台,则需在三台都部署es-ik,且配置要一样。

 

 

 

 

elasticsearch-analysis-ik-1.10.0.zip  对应于  elasticsearch-2.4.0

elasticsearch-analysis-ik-1.10.3.zip  对应于  elasticsearch-2.4.3

 

 

 

 

  我这里,已经给大家准备好了,以下是我的CSDN账号。下载好了,大家可以去下载。

 

http://download.csdn.net/detail/u010106732/9890897


http://download.csdn.net/detail/u010106732/9890918

 

 

 

 

 

 

 

https://github.com/medcl/elasticsearch-analysis-ik/tree/v1.10.0

 

 

 

 

 

 

  

 

 

 

 

 

 

  第一步: 在浏览器里,输入https://github.com/

 

 

 

 

  第二步https://github.com/search?utf8=%E2%9C%93&q=elasticsearch-ik

 

 

 

  

  第三步https://github.com/medcl/elasticsearch-analysis-ik  ,点击2.x 。当然也有一些人在用2.4.0版本,都适用。若你是使用5.X,则自己对号入座即可,这个很简单。

 

 

 

 

 

  第四步https://github.com/medcl/elasticsearch-analysis-ik/tree/2.x 得到

 

 

 

  第五步:找到之后,点击,下载,这里选择离线安装

  

 

 

 

  第六步:将Elasticsearch之中文分词器插件es-ik的压缩包解压下,初步认识下其目录结构,比如我这里放到D盘下来认识下。并为后续的maven编译做基础。

 

  

 

 

  第七步:用本地安装好的maven来编译

 

Microsoft Windows [版本 6.1.7601]
版权所有 (c) 2009 Microsoft Corporation。保留所有权利。

C:\Users\Administrator>cd D:\elasticsearch-analysis-ik-2.x

C:\Users\Administrator>d:

D:\elasticsearch-analysis-ik-2.x>mvn

 

 

   得到,

 

 

 

 

D:\elasticsearch-analysis-ik-2.x>mvn clean package -DskipTests
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building elasticsearch-analysis-ik 1.10.4
[INFO] ------------------------------------------------------------------------
Downloading: http://maven.aliyun.com/nexus/content/repositories/central/org/apac
he/maven/plugins/maven-enforcer-plugin/1.0/maven-enforcer-plugin-1.0.pom
Downloaded: http://maven.aliyun.com/nexus/content/repositories/central/org/apach
e/maven/plugins/maven-enforcer-plugin/1.0/maven-enforcer-plugin-1.0.pom (7 KB at
2.5 KB/sec)
Downloading: http://maven.aliyun.com/nexus/content/repositories/central/org/apac
he/maven/enforcer/enforcer/1.0/enforcer-1.0.pom
Downloaded: http://maven.aliyun.com/nexus/content/repositories/central/org/apach
e/maven/enforcer/enforcer/1.0/enforcer-1.0.pom (12 KB at 19.5 KB/sec)
Downloading: http://maven.aliyun.com/nexus/content/repositories/central/org/apac
he/maven/maven-parent/17/maven-parent-17.pom
Downloaded: http://maven.aliyun.com/nexus/content/repositories/central/org/apach
e/maven/maven-parent/17/maven-parent-17.pom (25 KB at 41.9 KB/sec)
Downloading: http://maven.aliyun.com/nexus/content/repositories/central/org/apac
he/maven/plugins/maven-enforcer-plugin/1.0/maven-enforcer-plugin-1.0.jar
Downloaded: http://maven.aliyun.com/nexus/content/repositories/central/org/apach
e/maven/plugins/maven-enforcer-plugin/1.0/maven-enforcer-plugin-1.0.jar (22 KB a
t 44.2 KB/sec)
Downloading: http://maven.aliyun.com/nexus/content/repositories/central/org/apac
he/maven/plugins/maven-compiler-plugin/3.5.1/maven-compiler-plugin-3.5.1.pom
Downloaded: http://maven.aliyun.com/nexus/content/repositories/central/org/apach
e/maven/plugins/maven-compiler-plugin/3.5.1/maven-compiler-plugin-3.5.1.pom (10
KB at 35.3 KB/sec)
Downloading: http://maven.aliyun.com/nexus/content/repositories/central/org/apac
he/maven/plugins/maven-plugins/28/maven-plugins-28.pom
Downloaded: http://maven.aliyun.com/nexus/content/repositories/central/org/apach
e/maven/plugins/maven-plugins/28/maven-plugins-28.pom (12 KB at 42.1 KB/sec)
Downloading: http://maven.aliyun.com/nexus/content/repositories/central/org/apac
he/maven/maven-parent/27/maven-parent-27.pom
Downloaded: http://maven.aliyun.com/nexus/content/repositories/central/org/apach
e/maven/maven-parent/27/maven-parent-27.pom (40 KB at 94.0 KB/sec)
Downloading: http://maven.aliyun.com/nexus/content/repositories/central/org/apac
he/apache/17/apache-17.pom
Downloaded: http://maven.aliyun.com/nexus/content/repositories/central/org/apach

 

 

 

   需要等待一会儿,这个根据自己的网速快慢。

 

Downloaded: http://maven.aliyun.com/nexus/content/repositories/central/org/apach
e/maven/maven-archiver/2.4/maven-archiver-2.4.jar (20 KB at 19.8 KB/sec)
Downloading: http://maven.aliyun.com/nexus/content/repositories/central/org/apac
he/maven/shared/maven-repository-builder/1.0-alpha-2/maven-repository-builder-1.
0-alpha-2.jar
Downloaded: http://maven.aliyun.com/nexus/content/repositories/central/org/apach
e/maven/maven-project/2.0.4/maven-project-2.0.4.jar (107 KB at 84.7 KB/sec)
Downloaded: http://maven.aliyun.com/nexus/content/repositories/central/org/codeh
aus/plexus/plexus-utils/2.0.1/plexus-utils-2.0.1.jar (217 KB at 158.7 KB/sec)
Downloaded: http://maven.aliyun.com/nexus/content/repositories/central/org/apach
e/maven/shared/maven-repository-builder/1.0-alpha-2/maven-repository-builder-1.0
-alpha-2.jar (23 KB at 16.4 KB/sec)
Downloaded: http://maven.aliyun.com/nexus/content/repositories/central/org/apach
e/maven/maven-model/2.0.4/maven-model-2.0.4.jar (79 KB at 54.3 KB/sec)
Downloaded: http://maven.aliyun.com/nexus/content/repositories/central/org/apach
e/maven/maven-artifact/2.0.4/maven-artifact-2.0.4.jar (79 KB at 52.9 KB/sec)
[INFO] Reading assembly descriptor: D:\elasticsearch-analysis-ik-2.x/src/main/as
semblies/plugin.xml
[INFO] Building zip: D:\elasticsearch-analysis-ik-2.x\target\releases\elasticsea
rch-analysis-ik-1.10.4.zip
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:22 min
[INFO] Finished at: 2017-02-25T14:48:40+08:00
[INFO] Final Memory: 35M/609M
[INFO] ------------------------------------------------------------------------

D:\elasticsearch-analysis-ik-2.x>

 

 

 

 

   成功,得到。

 

  这里,需要本地(即windows系统)里,提前安装好maven,需要来编译。若没安装的博友,请移步,见

Eclipse下Maven新建项目、自动打依赖jar包(包含普通项目和Web项目)

  

 

 

 

      最后得到是,

 

 

  第八步将最后编译好的,分别上传到3台机器里。$ES_HOME/plugins/ik 目录下,注意需要新建ik目录。

[hadoop@HadoopSlave1 elasticsearch-2.4.3]$ pwd
/home/hadoop/app/elasticsearch-2.4.3
[hadoop@HadoopSlave1 elasticsearch-2.4.3]$ ll
total 56
drwxrwxr-x. 2 hadoop hadoop 4096 Feb 22 01:37 bin
drwxrwxr-x. 3 hadoop hadoop 4096 Feb 22 22:43 config
drwxrwxr-x. 3 hadoop hadoop 4096 Feb 22 07:07 data
drwxrwxr-x. 2 hadoop hadoop 4096 Feb 22 01:37 lib
-rw-rw-r--. 1 hadoop hadoop 11358 Aug 24 2016 LICENSE.txt
drwxrwxr-x. 2 hadoop hadoop 4096 Feb 25 05:15 logs
drwxrwxr-x. 5 hadoop hadoop 4096 Dec 8 00:41 modules
-rw-rw-r--. 1 hadoop hadoop 150 Aug 24 2016 NOTICE.txt
drwxrwxr-x. 4 hadoop hadoop 4096 Feb 22 06:02 plugins
-rw-rw-r--. 1 hadoop hadoop 8700 Aug 24 2016 README.textile
[hadoop@HadoopSlave1 elasticsearch-2.4.3]$ cd plugins/
[hadoop@HadoopSlave1 plugins]$ ll
total 8
drwxrwxr-x. 5 hadoop hadoop 4096 Feb 22 06:02 head
drwxrwxr-x. 8 hadoop hadoop 4096 Feb 22 06:02 kopf
[hadoop@HadoopSlave1 plugins]$ mkdir ik
[hadoop@HadoopSlave1 plugins]$ pwd
/home/hadoop/app/elasticsearch-2.4.3/plugins
[hadoop@HadoopSlave1 plugins]$ ll
total 12
drwxrwxr-x. 5 hadoop hadoop 4096 Feb 22 06:02 head
drwxrwxr-x. 2 hadoop hadoop 4096 Feb 25 06:18 ik
drwxrwxr-x. 8 hadoop hadoop 4096 Feb 22 06:02 kopf

[hadoop@HadoopSlave1 plugins]$ cd ik/
[hadoop@HadoopSlave1 ik]$ pwd
/home/hadoop/app/elasticsearch-2.4.3/plugins/ik
[hadoop@HadoopSlave1 ik]$ rz

[hadoop@HadoopSlave1 ik]$ ll
total 4400
-rw-r--r--. 1 hadoop hadoop 4505518 Jan 15 08:59 elasticsearch-analysis-ik-1.10.3.zip
[hadoop@HadoopSlave1 ik]$

 

 

 

  第九步:关闭es服务进程

[hadoop@HadoopSlave1 ik]$ jps
1874 Elasticsearch
2078 Jps
[hadoop@HadoopSlave1 ik]$ kill -9 1874
[hadoop@HadoopSlave1 ik]$ jps
2089 Jps
[hadoop@HadoopSlave1 ik]$

 

 

  第十步:使用unzip命令解压,如果unzip命令不存在,则安装:yum install -y unzip。

 

[hadoop@HadoopSlave1 ik]$ unzip elasticsearch-analysis-ik-1.10.3.zip
Archive: elasticsearch-analysis-ik-1.10.3.zip
inflating: elasticsearch-analysis-ik-1.10.3.jar
inflating: httpclient-4.5.2.jar
inflating: httpcore-4.4.4.jar
inflating: commons-logging-1.2.jar
inflating: commons-codec-1.9.jar
inflating: plugin-descriptor.properties
creating: config/
creating: config/custom/
inflating: config/custom/ext_stopword.dic
inflating: config/custom/mydict.dic
inflating: config/custom/single_word.dic
inflating: config/custom/single_word_full.dic
inflating: config/custom/single_word_low_freq.dic
inflating: config/custom/sougou.dic
inflating: config/IKAnalyzer.cfg.xml
inflating: config/main.dic
inflating: config/preposition.dic
inflating: config/quantifier.dic
inflating: config/stopword.dic
inflating: config/suffix.dic
inflating: config/surname.dic
[hadoop@HadoopSlave1 ik]$ ll
total 5828
-rw-r--r--. 1 hadoop hadoop 263965 Dec 1 2015 commons-codec-1.9.jar
-rw-r--r--. 1 hadoop hadoop 61829 Dec 1 2015 commons-logging-1.2.jar
drwxr-xr-x. 3 hadoop hadoop 4096 Jan 1 12:46 config
-rw-r--r--. 1 hadoop hadoop 55998 Jan 1 13:27 elasticsearch-analysis-ik-1.10.3.jar
-rw-r--r--. 1 hadoop hadoop 4505518 Jan 15 08:59 elasticsearch-analysis-ik-1.10.3.zip
-rw-r--r--. 1 hadoop hadoop 736658 Jan 1 13:26 httpclient-4.5.2.jar
-rw-r--r--. 1 hadoop hadoop 326724 Jan 1 13:07 httpcore-4.4.4.jar
-rw-r--r--. 1 hadoop hadoop 2667 Jan 1 13:27 plugin-descriptor.properties

[hadoop@HadoopSlave1 ik]$ 

   

  同理,其他两台也是。

 

 

 

 

  第十一步:重启三台机器的es服务进程

 

 

 

 

 

   其实,若想更具体地,看得,es安装中文分词器es-ik之后,的变化情况,直接,在$ES_HOME下,执行bin/elasticsearch。当然,我这里只是给你展示下而已,还是用bin/elasticsearch -d在后台启动吧!

 

 

 

 

 

 

 

   第十二步:测试,安装了es中文分词插件es-ik之后的对中文分词效果

  ik_max_word方式来分词测试

[hadoop@HadoopMaster elasticsearch-2.4.3]$ pwd
/home/hadoop/app/elasticsearch-2.4.3
[hadoop@HadoopMaster elasticsearch-2.4.3]$ curl 'http://192.168.80.10:9200/zhouls/_analyze?analyzer=ik_max_word&pretty=true' -d '{"text":"这里是好记性不如烂笔头感叹号的博客园"}'
{
"tokens" : [ {
"token" : "这里是",
"start_offset" : 0,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0
}, {
"token" : "这里",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 1
}, {
"token" : "",
"start_offset" : 1,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 2
}, {
"token" : "好记",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 3
}, {
"token" : "记性",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",

"position" : 4
}, {
"token" : "不如",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 5
}, {
"token" : "",
"start_offset" : 8,
"end_offset" : 9,
"type" : "CN_CHAR",
"position" : 6
}, {
"token" : "笔头",
"start_offset" : 9,
"end_offset" : 11,
"type" : "CN_WORD",
"position" : 7
}, {
"token" : "",
"start_offset" : 9,
"end_offset" : 10,
"type" : "CN_WORD",
"position" : 8
}, {
"token" : "",
"start_offset" : 10,
"end_offset" : 11,
"type" : "CN_CHAR",

"position" : 9
}, {
"token" : "感叹号",
"start_offset" : 11,
"end_offset" : 14,
"type" : "CN_WORD",
"position" : 10
}, {
"token" : "感叹",
"start_offset" : 11,
"end_offset" : 13,
"type" : "CN_WORD",
"position" : 11
}, {
"token" : "叹号",
"start_offset" : 12,
"end_offset" : 14,
"type" : "CN_WORD",
"position" : 12
}, {
"token" : "",
"start_offset" : 12,
"end_offset" : 13,
"type" : "CN_WORD",
"position" : 13
}, {
"token" : "",
"start_offset" : 13,
"end_offset" : 14,
"type" : "CN_CHAR",

"position" : 14
}, {
"token" : "博客园",
"start_offset" : 15,
"end_offset" : 18,
"type" : "CN_WORD",
"position" : 15
}, {
"token" : "博客",
"start_offset" : 15,
"end_offset" : 17,
"type" : "CN_WORD",
"position" : 16
}, {
"token" : "",
"start_offset" : 17,
"end_offset" : 18,
"type" : "CN_CHAR",
"position" : 17
} ]
}
[hadoop@HadoopMaster elasticsearch-2.4.3]$

 

 

 

 

[hadoop@HadoopMaster elasticsearch-2.4.3]$ curl 'http://192.168.80.10:9200/zhouls/_analyze?analyzer=ik_max_word&pretty=true' -d '{"text":"我们是大数据开发技术人员"}'
{
"tokens" : [ {
"token" : "我们",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
}, {
"token" : "大数",
"start_offset" : 3,
"end_offset" : 5,
"type" : "CN_WORD",
"position" : 1
}, {
"token" : "数据",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 2
}, {
"token" : "开发",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 3
}, {
"token" : "",
"start_offset" : 7,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 4
}, {

"token" : "技术人员",
"start_offset" : 8,
"end_offset" : 12,
"type" : "CN_WORD",
"position" : 5
}, {
"token" : "技术",
"start_offset" : 8,
"end_offset" : 10,
"type" : "CN_WORD",
"position" : 6
}, {
"token" : "人员",
"start_offset" : 10,
"end_offset" : 12,
"type" : "CN_WORD",
"position" : 7
} ]
}
[hadoop@HadoopMaster elasticsearch-2.4.3]$

 

    可以看出,成功分词了且效果更好!

 

 

 

   其实,啊,为什么“是”没有了呢?是因为es的中文分词器插件es-ik的过滤停止词的贡献!请移步,如下

Elasticsearch之IKAnalyzer的过滤停止词

 

 

 

 

 

 

es官方文档提供的ik_max_word和ik_smart解释

      https://github.com/medcl/elasticsearch-analysis-ik/tree/2.x

 

 

 

 

 

ik_smart方式来分词测试

[hadoop@HadoopMaster elasticsearch-2.4.3]$ curl 'http://192.168.80.10:9200/zhouls/_analyze?analyzer=ik_smart&pretty=true' -d '{"text":"这里是好记性不如烂笔头感叹号的博客园"}'
{
"tokens" : [ {
"token" : "这里是",
"start_offset" : 0,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0
}, {
"token" : "好",
"start_offset" : 3,
"end_offset" : 4,
"type" : "CN_CHAR",
"position" : 1
}, {
"token" : "记性",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 2
}, {
"token" : "不如",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 3
}, {
"token" : "烂",
"start_offset" : 8,
"end_offset" : 9,
"type" : "CN_CHAR",
"position" : 4
}, {

 

"token" : "笔头",
"start_offset" : 9,
"end_offset" : 11,
"type" : "CN_WORD",
"position" : 5
}, {
"token" : "感叹号",
"start_offset" : 11,
"end_offset" : 14,
"type" : "CN_WORD",
"position" : 6
}, {
"token" : "博客园",
"start_offset" : 15,
"end_offset" : 18,
"type" : "CN_WORD",
"position" : 7
} ]
}
[hadoop@HadoopMaster elasticsearch-2.4.3]$

 

 

 

 

 

 

 

[hadoop@HadoopMaster elasticsearch-2.4.3]$ curl 'http://192.168.80.10:9200/zhouls/_analyze?analyzer=ik_smart&pretty=true' -d '{"text":"我们是大数据开发技术人员"}'
{
"tokens" : [ {
"token" : "我们",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
}, {
"token" : "大",
"start_offset" : 3,
"end_offset" : 4,
"type" : "CN_CHAR",
"position" : 1
}, {
"token" : "数据",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 2
}, {
"token" : "开发",
"start_offset" : 6,
"end_offset" : 8,
"type" : "CN_WORD",
"position" : 3
}, {
"token" : "技术人员",
"start_offset" : 8,
"end_offset" : 12,
"type" : "CN_WORD",
"position" : 4
} ]

}
[hadoop@HadoopMaster elasticsearch-2.4.3]$

elasticsearch之中文分词器插件es-ik的热更新词库

参考:http://www.mamicode.com/info-detail-1705113.html 先声明,热更新词库,需要用到,web项目和Tomcat。不会的,请移步Eclipse下Maven新建项目、自动打依赖jar包(包含普通项目和Web项目)Tomcat*的安装和运行(绿色版和安装版都适用)To... 查看详情

elasticsearch安装中文分词器插件smartcn

原文:http://blog.java1234.com/blog/articles/373.htmlelasticsearch安装中文分词器插件smartcn elasticsearch默认分词器比较坑,中文的话,直接分词成单个汉字。我们这里来介绍下smartcn插件,这个是官方推荐的,中科院搞的,基本能满足需求... 查看详情

dockerfile构建elasticsearch镜像安装ik中文分词器插件(代码片段)

DockerFile构建ElasticSearch镜像安装IK中文分词器插件为什么要安装IK中文分词器?ES提供的分词是英文分词,对中文做分词时会拆成单字而不是词语,非常不好,因此索引信息含中文时需要使用中文分词器插件。一、环境及文件准备... 查看详情

elasticsearch中文分词器对比

参考技术A对以上分词器进行了一个粗略对比:截止到目前为止,他们的分词准确性从高到低依次是:结合准确性来看,选用中文分词器基于以下考虑:截止目前,IK分词器插件的优势是支持自定义热更新远程词典。IK分词器的github... 查看详情

elasticsearch搜索引擎安装配置中文分词器ik插件(代码片段)

一、IK简介ElasticSearch(以下简称ES)默认的分词器是标准分词器Standard,如果直接使用在处理中文内容的搜索时,中文词语被分成了一个一个的汉字,因此引入中文分词器IK就能解决这个问题,同时用户可以配置自己的扩展字典、... 查看详情

elasticsearch之中文分词器插件es-ik的自定义词库(代码片段)

...sp;  非常重要![[email protected]custom]$pwd/home/hadoop/app/elasticsearch-2.4.3/plugins/ik/config/custom[[email protected]custom]$lltotal5252-rw-r--r--.1hadoophadoop156Dec1410:34ext_stopword.dic-rw-r--r--.1hadoophadoop130Dec1410:34mydict.dic-rw-r--r--.1hadoophadoop63188Dec1410:34... 查看详情

elasticsearch——手写一个elasticsearch分词器(附源码)(代码片段)

1.分词器插件ElasticSearch提供了对文本内容进行分词的插件系统,对于不同的语言的文字分词器,规则一般是不一样的,而ElasticSearch提供的插件机制可以很好的集成各语种的分词器。Elasticsearch本身并不支持中文分词... 查看详情

elasticsearch安装中文分词器(代码片段)

发车   为什么要在elasticsearch中要使用ik这样的中文分词呢,那是因为es提供的分词是英文分词,对于中文的分词就做的非常不好了,因此我们需要一个中文分词器来用于搜索和使用。今天我们就尝试安装下IK分词。上... 查看详情

elasticsearch入门之从零开始安装ik分词器

起因需要在ES中使用聚合进行统计分析,但是聚合字段值为中文,ES的默认分词器对于中文支持非常不友好:会把完整的中文词语拆分为一系列独立的汉字进行聚合,显然这并不是我的初衷。我们来看个实例:POSThttp://192.168.80.133:... 查看详情

elasticsearch安装中文分词插件ik

Elasticsearch默认提供的分词器,会把每一个汉字分开,而不是我们想要的依据关键词来分词。比如:curl-XPOST"http://localhost:9200/userinfo/_analyze?analyzer=standard&pretty=true&text=我是中国人"我们会得到这种结果:{tokens:[{token:textstart_offs... 查看详情

day112es中文分词介绍

一中文分词介绍elasticsearch提供了几个内置的分词器:standardanalyzer(标准分词器)、simpleanalyzer(简单分词器)、whitespaceanalyzer(空格分词器)、languageanalyzer(语言分词器)而如果我们不指定分词器类型的话,elasticsearch默认是使用标... 查看详情

elasticsearch连续剧之分词器(代码片段)

...下分词器的真实面目!二、默认分词器standardanalyzer:Elasticsearch默认分词器,根据空格和标点符号对英文进行分词,会进行单词的大小写转换。默认分词器是英文分词器,对中文的分词是一字一词。三、IK分词器IK分... 查看详情

docker安装elasticsearch的中文分词器ik(代码片段)

主要问题一定要保证ElasticSearch和ElasticSearch插件的版本一致我是用的是ElasticSearch是5.6.11版本对应elasticsearch-analysis-ik-5.6.11安装插件在线安装进入容器dockerexec-itelasticsearch/bin/bash在线下载并安装./bin/elasticsearch-plugininstallhttps://github.co... 查看详情

elasticsearch实战(二十六)-ik中文分词器(代码片段)

        为什么要在elasticsearch中要使用ik这样的中文分词?因为ES提供的分词是英文分词,对于中文的分词就做的非常不好了,因此我们需要一个中文分词器来用于搜索和使用。一、安装        我们可以从 官... 查看详情

elasticsearch第三步-中文分词

elasticsearch官方只提供smartcn这个中文分词插件,效果不是很好,好在国内有medcl大神(国内最早研究es的人之一)写的两个中文分词插件,一个是ik的,一个是mmseg的,下面分别介绍ik的用法,当我们创建一个index(库db_news)时,eas... 查看详情

elasticsearch中文分词器详解(代码片段)

1.es安装中文分词器官网:https://github.com/medcl/elasticsearch-analysis-ik1.1.安装中文分词器安装中文分词器的要求:​1.分词器的版本要与es的版本一直​2.所有es节点都需要安装中文分词器​3.安装完分词器需要重启1.在线安装[root@elastics... 查看详情

elasticsearch使用指南之初始环境搭建

Elasticsearch使用指南之初始环境搭建Elastic系列要求所有软件版本号必须一致本文使用Windows下的Elastic8.4.3系列下载elasticsearchv8.4.3//es引擎主体kibanav8.4.3//es官方可视化管理工具analysis-ikv8.4.3//ik分词器,涉及中文搜索必须装个中文... 查看详情

elasticsearch中文分词(代码片段)

...hobby","text":"听音乐"     中文分词:  IK分词器Elasticsearch插件地址:https://github.com/medcl/elasticsearch-analysis-ik  安装方法:将下载到的elasticsearch-analysis-ik-6.5.4.zip解压到elasticsearch/plugins/ik目录下即可。    unzipelastics... 查看详情