Elasticsearch分词器

网友投稿 272 2022-11-17

Elasticsearch分词器

ES文档的数据拆分成一个个有完整含义的关键词，并将关键词与文档对应，这样就可以通过关键词查询文档。要想正确的分词，需要选择合适的分词器。

1.默认分词器

standard analyzer：Elasticsearch 默认分词器，根据空格和标点符号对英文进行分词，会进行单词的大小写转换。默认分词器是英文分词器，对中文的分词是一字一词。查看分词效果：

GET /_analyze { "text":测试语句, "analyzer":分词器 }

1.英文分词GET /_analyze{ "text":"iphone13 is the better", "analyzer": "standard"}分词结果：{ "tokens" : [ { "token" : "iphone13", "start_offset" : 0, "end_offset" : 8, "type" : "", "position" : 0 }, { "token" : "is", "start_offset" : 9, "end_offset" : 11, "type" : "", "position" : 1 }, { "token" : "the", "start_offset" : 12, "end_offset" : 15, "type" : "", "position" : 2 }, { "token" : "better", "start_offset" : 16, "end_offset" : 22, "type" : "", "position" : 3 } ]}2.中文分词GET /_analyze{ "text":"科比是NBA最伟大的运动员", "analyzer": "standard"}分词结果：{ "tokens" : [ { "token" : "科", "start_offset" : 0, "end_offset" : 1, "type" : "", "position" : 0 }, { "token" : "比", "start_offset" : 1, "end_offset" : 2, "type" : "", "position" : 1 }, { "token" : "是", "start_offset" : 2, "end_offset" : 3, "type" : "", "position" : 2 }, { "token" : "nba", "start_offset" : 3, "end_offset" : 6, "type" : "", "position" : 3 }, { "token" : "最", "start_offset" : 6, "end_offset" : 7, "type" : "", "position" : 4 }, { "token" : "伟", "start_offset" : 7, "end_offset" : 8, "type" : "", "position" : 5 }, { "token" : "大", "start_offset" : 8, "end_offset" : 9, "type" : "", "position" : 6 }, { "token" : "的", "start_offset" : 9, "end_offset" : 10, "type" : "", "position" : 7 }, { "token" : "运", "start_offset" : 10, "end_offset" : 11, "type" : "", "position" : 8 }, { "token" : "动", "start_offset" : 11, "end_offset" : 12, "type" : "", "position" : 9 }, { "token" : "员", "start_offset" : 12, "end_offset" : 13, "type" : "", "position" : 10 } ]}

2.IK分词器

概念:IKAnalyzer是一个开源的，基于java语言开发的轻量级的中文分词工具包。提供了两种分词算法：1.ik_smart：最少切分

GET /_analyze{ "text":"科比是NBA最伟大的运动员", "analyzer": "ik_smart"}分词数量为7{ "tokens" : [ { "token" : "科", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0 }, { "token" : "比", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1 }, { "token" : "是", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 2 }, { "token" : "nba", "start_offset" : 3, "end_offset" : 6, "type" : "ENGLISH", "position" : 3 }, { "token" : "最", "start_offset" : 6, "end_offset" : 7, "type" : "CN_CHAR", "position" : 4 }, { "token" : "伟大", "start_offset" : 7, "end_offset" : 9, "type" : "CN_WORD", "position" : 5 }, { "token" : "的", "start_offset" : 9, "end_offset" : 10, "type" : "CN_CHAR", "position" : 6 }, { "token" : "运动员", "start_offset" : 10, "end_offset" : 13, "type" : "CN_WORD", "position" : 7 } ]}

2.ik_max_word：最细粒度划分

GET /_analyze{ "text":"科比是NBA最伟大的运动员", "analyzer": "ik_max_word"}分词数量为9{ "tokens" : [ { "token" : "科", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0 }, { "token" : "比", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1 }, { "token" : "是", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 2 }, { "token" : "nba", "start_offset" : 3, "end_offset" : 6, "type" : "ENGLISH", "position" : 3 }, { "token" : "最", "start_offset" : 6, "end_offset" : 7, "type" : "CN_CHAR", "position" : 4 }, { "token" : "伟大", "start_offset" : 7, "end_offset" : 9, "type" : "CN_WORD", "position" : 5 }, { "token" : "的", "start_offset" : 9, "end_offset" : 10, "type" : "CN_CHAR", "position" : 6 }, { "token" : "运动员", "start_offset" : 10, "end_offset" : 13, "type" : "CN_WORD", "position" : 7 }, { "token" : "运动", "start_offset" : 10, "end_offset" : 12, "type" : "CN_WORD", "position" : 8 }, { "token" : "动员", "start_offset" : 11, "end_offset" : 13, "type" : "CN_WORD", "position" : 9 } ]}

安装解压elasticsearch-analysis-ik，将解压后的文件夹拷贝到elasticsearch的plugins目录下。

[root@node0 plugins]# lselasticsearch-analysis-ik-7.12.1

注意：ik分词器的版本要和es版本保持一致重启es

词典

[root@node0 config]# lltotal 8268-rw-r--r-- 1 root root 7 Nov 29 17:57 ext_dict.dic-rw-r--r-- 1 root root 5225922 Apr 25 2021 extra_main.dic-rw-r--r-- 1 root root 63188 Apr 25 2021 extra_single_word.dic-rw-r--r-- 1 root root 63188 Apr 25 2021 extra_single_word_full.dic-rw-r--r-- 1 root root 10855 Apr 25 2021 extra_single_word_low_freq.dic-rw-r--r-- 1 root root 156 Apr 25 2021 extra_stopword.dic-rw-r--r-- 1 root root 7 Nov 29 18:07 ext_stopwords.dic-rw-r--r-- 1 root root 654 Nov 29 17:56 IKAnalyzer.cfg.xml-rw-r--r-- 1 root root 3058510 Apr 25 2021 main.dic-rw-r--r-- 1 root root 123 Apr 25 2021 preposition.dic-rw-r--r-- 1 root root 1824 Apr 25 2021 quantifier.dic-rw-r--r-- 1 root root 164 Apr 25 2021 stopword.dic-rw-r--r-- 1 root root 192 Apr 25 2021 suffix.dic-rw-r--r-- 1 root root 752 Apr 25 2021 surname.dic

IK Analyzer 扩展配置 ext_dict.dic ext_stopwords.dic

K分词器根据词典进行分词，词典文件在IK分词器的config目录中。 1.main.dic：IK 中内置的词典。记录了 IK 统计的所有中文单词。 2.IKAnalyzer.cfg.xml：用于配置自定义词库。

ext_dict：自定义扩展词库，是对 main.dic 文件的扩展。ext_stopwords：自定义停用词。

ik的所有的 dic 词库文件，必须使用UTF-8 字符集。不建议使用记事本编辑，记事本使用的是 GBK字符集。

测试分词器效果将“科比”添加到 ext_dict.dic文件里

GET /_analyze{ "text":"科比是NBA最伟大的运动员", "analyzer": "ik_smart"}测试结果将“科比”作为分词{ "tokens" : [ { "token" : "科比", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "是", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 1 }, { "token" : "nba", "start_offset" : 3, "end_offset" : 6, "type" : "ENGLISH", "position" : 2 }, { "token" : "最", "start_offset" : 6, "end_offset" : 7, "type" : "CN_CHAR", "position" : 3 }, { "token" : "伟大", "start_offset" : 7, "end_offset" : 9, "type" : "CN_WORD", "position" : 4 }, { "token" : "的", "start_offset" : 9, "end_offset" : 10, "type" : "CN_CHAR", "position" : 5 }, { "token" : "运动员", "start_offset" : 10, "end_offset" : 13, "type" : "CN_WORD", "position" : 6 } ]}

将“科比”添加到ext_stopwords.dic文件里测试分词，

GET /_analyze{ "text":"科比是NBA最伟大的运动员", "analyzer": "ik_smart"}分词中“科比被禁用”{ "tokens" : [ { "token" : "是", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 0 }, { "token" : "nba", "start_offset" : 3, "end_offset" : 6, "type" : "ENGLISH", "position" : 1 }, { "token" : "最", "start_offset" : 6, "end_offset" : 7, "type" : "CN_CHAR", "position" : 2 }, { "token" : "伟大", "start_offset" : 7, "end_offset" : 9, "type" : "CN_WORD", "position" : 3 }, { "token" : "的", "start_offset" : 9, "end_offset" : 10, "type" : "CN_CHAR", "position" : 4 }, { "token" : "运动员", "start_offset" : 10, "end_offset" : 13, "type" : "CN_WORD", "position" : 5 } ]}

3.拼音分词器

概念拼音分词器可以将中文分成对应的全拼，全拼首字母等。安装解压elasticsearch-analysis-pinyin，将解压后的文件夹拷贝elasticsearch的plugins目录下。

[root@node0 plugins]# lltotal 8drwxr-xr-x 3 root root 4096 Nov 29 17:38 elasticsearch-analysis-ik-7.12.1drwxr-xr-x 2 root root 4096 Nov 29 18:36 elasticsearch-analysis-pinyin-7.12.1

注：拼音分词器的版本要和es版本保持一致。重启es

测试分词结果

GET /_analyze{ "text":"科比是NBA最伟大的运动员", "analyzer": "pinyin"}分词结果为拼音{ "tokens" : [ { "token" : "ke", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 0 }, { "token" : "kbsnbazwddydy", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 0 }, { "token" : "bi", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 1 }, { "token" : "shi", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 2 }, { "token" : "n", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 3 }, { "token" : "ba", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 4 }, { "token" : "zui", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 5 }, { "token" : "wei", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 6 }, { "token" : "da", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 7 }, { "token" : "de", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 8 }, { "token" : "yun", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 9 }, { "token" : "dong", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 10 }, { "token" : "yuan", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 11 } ]}

标签：工具

暂时没有评论，来抢沙发吧~

Elasticsearch分词器

linux cpu占用率如何看

宝塔数据库如何清理缓存

oracle怎么创建存储过程

推荐文章

api接口有哪几种分类及功能

什么是API接口?API接口简单介绍

短信API接口概述，短信API接口的优势

7款快递物流的物流查询API工具，物流快递查询API接口怎么对接？

企业四要素: 了解企业经营成功的关键

什么是语音验证码?,语音验证码平台有哪些

全国工商查询系统怎么查企业名录

哪些平台提供实名认证的接口？

PHP如何调用API接口?

如何使用百度天气预报API接口?

最近发表

热评文章

数据接口api（数据接口API开发平台）

数据开放接口api（数据服务api开发）

Python爬虫教程：爬取酷狗音乐（python爬取

hbuilder怎么更改字体大小和颜色

直播平台api接口 - 构建卓越的直播平台

实时股票数据api接口（股票实时行情api接口）