一款运行于Elasticsearch之上的中文拼音智能分词插件,支持全拼、首字母、中文混合搜索
LC version |
ES version |
---|---|
master | 5.3.0 -> master |
5.3.0.1 | 5.3.0 |
5.2.2.1 | 5.2.2 |
5.2.0.1 | 5.2.0 |
5.1.2.1 | 5.1.2 |
5.1.1.1 | 5.1.1 |
5.0.2.1 | 5.0.2 |
5.0.1.2 | 5.0.1 |
5.0.1.1 | 5.0.1 |
2.4.2.1 | 2.4.2 |
2.2.2.1 | 2.2.2 |
1.4.5.2 | 1.4.5 |
1.4.5.1 | 1.4.5 |
elasticsearch-analysis-lc-pinyin是一款elasticsearch拼音分词插件,可以支持按照全拼、首字母,中文混合搜索。 例如我们在某宝搜索框中输入“jianpan” 可以搜索到关键字包含“键盘”的商品。不仅仅输入全拼,有时候我们输入首字母、拼音和首字母、中文和首字母的混合输入,比如:“键pan”、“j盘”、“jianp”、“jpan”、“jianp”、“jp” 等等,都应该匹配到键盘。通过elasticsearch-analysis-lc-pinyin这个插件就能做到类似的搜索
- 此拼音插件主要用在
短文档的搜索上,如文章的标题、作者,商品的品牌等,不建议用在长文档中
lc_index: 该分词器用于索引数据时指定,将中文转换为全拼和首字,同时保留中文 lc_search: 该分词器用于拼音搜索时指定,按最小拼音分词个数拆分拼音,优先拆分全拼
lc_index:参数 mode: full_pinyin,first_letter,chinese_char lc_search:参数 mode: smart_pinyin,single_letter
lc_full_pinyin:将中文Token转全拼(支持多音字) lc_first_letter:将中文Token转首字母(支持多音字)
以下是使用ik分词结合拼音过滤器使用例子: 1. 创建一个索引,并定义分析器
bash PUT /index { "settings": { "analysis": { "analyzer": { "ik_letter_smart": { "type": "custom", "tokenizer": "ik_max_word", "filter": [ "lc_first_letter" ] }, "ik_py_smart": { "type": "custom", "tokenizer": "ik_max_word", "filter": [ "lc_full_pinyin" ] } } } } }2. 测试分词 ```bash curl -XGET 'http://192.168.0.109:9200/index/analyze?analyzer=ikpy_smart&pretty&text=英雄难过美人关'
{ "tokens" : [ { "token" : "yingxiongnanguomeirenguan", "startoffset" : 0, "endoffset" : 7, "type" : "CNWORD", "position" : 0 }, { "token" : "yingxiongnanguo", "startoffset" : 0, "endoffset" : 4, "type" : "CNWORD", "position" : 1 }, { "token" : "yingxiong", "startoffset" : 0, "endoffset" : 2, "type" : "CNWORD", "position" : 2 }, { "token" : "nanguo", "startoffset" : 2, "endoffset" : 4, "type" : "CNWORD", "position" : 3 }, { "token" : "meirenguan", "startoffset" : 4, "endoffset" : 7, "type" : "CNWORD", "position" : 4 }, { "token" : "meiren", "startoffset" : 4, "endoffset" : 6, "type" : "CNWORD", "position" : 5 }, { "token" : "guan", "startoffset" : 6, "endoffset" : 7, "type" : "CN_CHAR", "position" : 6 } ] } ```
1.创建一个索引
index
curl -XPUT http://localhost:9200/index
2.创建类型
brand的mapping
curl -XPOST http://localhost:9200/index/_mapping/brand -d' { "brand": { "properties": { "name": { "type": "text", "analyzer": "lc_index", "search_analyzer": "lc_search", "term_vector": "with_positions_offsets" } } } }'
3.索引一些互联网品牌
curl -XPOST http://localhost:9200/index/brand/1 -d'{"name":"百度"}' curl -XPOST http://localhost:9200/index/brand/8 -d'{"name":"百度糯米"}' curl -XPOST http://localhost:9200/index/brand/2 -d'{"name":"阿里巴巴"}' curl -XPOST http://localhost:9200/index/brand/3 -d'{"name":"腾讯科技"}' curl -XPOST http://localhost:9200/index/brand/4 -d'{"name":"网易游戏"}' curl -XPOST http://localhost:9200/index/brand/9 -d'{"name":"大众点评"}' curl -XPOST http://localhost:9200/index/brand/10 -d'{"name":"携程旅行网"}'
4.编写高亮查询DSL
此示例通过
lc_search分词器配合
match_phrase查询实现品牌的
全拼搜索 搜索全拼关键字
baidu,请求DSL如下:
bash curl -XPOST http://localhost:9200/index/brand/_search -d' { "query": { "match": { "name": { "query": "baidu", "analyzer": "lc_search", "type": "phrase" } } }, "highlight" : { "pre_tags" : [""], "post_tags" : [""], "fields" : { "name" : {} } } }'
匹配到
百度、
百度糯米两个品牌 tip:
百度排在
百度糯米的前面,因为name字段长度更短 查询结果:
bash { "took": 18, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 2, "max_score": 2.5751648, "hits": [ { "_index": "index", "_type": "brand", "_id": "1", "_score": 2.5751648, "_source": { "name": "百度" }, "highlight": { "name": [ "百度" ] } } , { "_index": "index", "_type": "brand", "_id": "8", "_score": 2.0601318, "_source": { "name": "百度糯米" }, "highlight": { "name": [ "百度糯米" ] } } ] } }
此示例通过
lc_search分词器配合
match_phrase查询实现品牌的
中文&全拼搜索 搜索全拼关键字
xie程lu行wang,请求DSL如下:
bash curl -XPOST http://localhost:9200/index/brand/_search -d' { "query": { "match": { "name": { "query": "xie程lu行", "analyzer": "lc_search", "type": "phrase" } } }, "highlight" : { "pre_tags" : [""], "post_tags" : [""], "fields" : { "name" : {} } } }'
匹配到
携程旅行网结果如下: ```bash
携程旅行网结果如下:
{ "took": 4, "timedout": false, "shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 1, "maxscore": 4.5665164, "hits": [ { "index": "index", "type": "brand", "id": "10", "score": 4.5665164, "source": { "name": "携程旅行网" }, "highlight": { "name": [ "携程旅行网" ] } } ] } }
```bash # 此示例通过`lc_search`分词器配合`match_phrase`查询实现品牌的`首字母`搜索 # 此示例中也可以通过`lc_first_letter`分词器搜索,结果和`lc_search`一样 # # 这两个分词器的主要区别: # lc_first_letterl 会把所有输入的字母拆分成单字母用于首字母匹配 # lc_search 会优先把输入的字母串拆分成全拼,并找到一个最优拆分结果 # # 搜索全拼关键字`albb`,请求DSL如下: curl -XPOST http://localhost:9200/index/brand/_search -d' { "query": { "match": { "name": { "query": "albb", "analyzer": "lc_search", "type": "phrase" } } }, "highlight" : { "pre_tags" : [""], "post_tags" : [""], "fields" : { "name" : {} } } }'
匹配到
阿里巴巴,结果如下:
bash { "took": 4, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 1, "max_score": 3.9560113, "hits": [ { "_index": "index", "_type": "brand", "_id": "2", "_score": 3.9560113, "_source": { "name": "阿里巴巴" }, "highlight": { "name": [ "阿里巴巴" ] } } ] } }
//java api实现 //该查询会匹配`阿里巴巴`这条数据 QueryBuilder pinyinQueryBuilder = QueryBuilders.matchPhraseQuery("name", "ali巴b").analyzer("lc_search"); SearchRequestBuilder requestBuilder = client.prepareSearch("index").setTypes("brand"); requestBuilder.setQuery(pinyinQueryBuilder) .setHighlighterPreTags("") .setHighlighterPostTags("") .addHighlightedField("name") .execute().actionGet();//该查询会匹配
大众点评
这条数据 QueryBuilder pinyinQueryBuilder = QueryBuilders.matchPhraseQuery("name", "dzdp").analyzer("lc_first_letter"); SearchRequestBuilder requestBuilder = client.prepareSearch("index").setTypes("brand"); requestBuilder.setQuery(pinyinQueryBuilder) .setHighlighterPreTags("") .setHighlighterPostTags("") .addHighlightedField("name") .execute().actionGet();//该查询也会匹配
大众点评
这条数据 QueryBuilder pinyinQueryBuilder = QueryBuilders.matchPhraseQuery("name", "dzdp").analyzer("lc_search"); SearchRequestBuilder requestBuilder = client.prepareSearch("index").setTypes("brand"); requestBuilder.setQuery(pinyinQueryBuilder) .setHighlighterPreTags("") .setHighlighterPostTags("") .addHighlightedField("name") .execute().actionGet();
作者: @陈楠 Email: [email protected]
<完>