分词器是专门处理分词的组件,由三部分组成
analyzer 分析器:
GET _analyze { "analyzer": "standard", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." } =================== 结果 V =================== { "tokens" : [ { "token" : "2", "start_offset" : 0, "end_offset" : 1, "type" : "<NUM>", "position" : 0 }, { "token" : "running", "start_offset" : 2, "end_offset" : 9, "type" : "<ALPHANUM>", "position" : 1 }, ...... { "token" : "evening", "start_offset" : 62, "end_offset" : 69, "type" : "<ALPHANUM>", "position" : 12 } ] }
GET _analyze { "analyzer": "stop", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." } =================== 结果 V =================== { "tokens" : [ { "token" : "running", "start_offset" : 2, "end_offset" : 9, "type" : "word", "position" : 0 }, { "token" : "quick", "start_offset" : 10, "end_offset" : 15, "type" : "word", "position" : 1 }, ...... { "token" : "evening", "start_offset" : 62, "end_offset" : 69, "type" : "word", "position" : 11 } ] }
#simpe GET _analyze { "analyzer": "simple", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." } GET _analyze { "analyzer": "stop", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." } #stop GET _analyze { "analyzer": "whitespace", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." } #keyword GET _analyze { "analyzer": "keyword", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." } GET _analyze { "analyzer": "pattern", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." } #english GET _analyze { "analyzer": "english", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." } POST _analyze { "analyzer": "icu_analyzer", "text": "他说的确实在理”" } POST _analyze { "analyzer": "standard", "text": "他说的确实在理”" } POST _analyze { "analyzer": "icu_analyzer", "text": "这个苹果不大好吃" }
需要注意的是,
icu_analyzer
分析器; 包括ik
分析器; 并非 Elasticsearch 7.8.0 自带分析器.
需要执行命令:./bin/elasticsearch-plugin install analysis-icu
自行安装并重启 elasticsearch 才能使用
支持自定义词库,支持热更新分词
https://gitee.com/mirrors/elasticsearch-analysis-ik?_from=gitee_search
清华大学自然语言处理和社会人文计算实验室的一套中文分词器
https://gitee.com/puremilk/THULAC-Python?_from=gitee_search