elasticsearch-analysis-lc-pinyin

by gitchennan

一款运行于Elasticsearch之上的中文拼音智能分词插件,支持全拼、首字母、中文混合搜索

127 Stars 47 Forks Last release: Not found Artistic License 2.0 46 Commits 11 Releases

Available items

No Items, yet!

The developer of this repository has not created any items for sale yet. Need a bug fixed? Help with integration? A different license? Create a request here:

LC Pinyin Analysis for Elasticsearch

Lc Pinyin版本

LC version

ES version
master 5.3.0 -> master
5.3.0.1 5.3.0
5.2.2.1 5.2.2
5.2.0.1 5.2.0
5.1.2.1 5.1.2
5.1.1.1 5.1.1
5.0.2.1 5.0.2
5.0.1.2 5.0.1
5.0.1.1 5.0.1
2.4.2.1 2.4.2
2.2.2.1 2.2.2
1.4.5.2 1.4.5
1.4.5.1 1.4.5

Lc Pinyin介绍

elasticsearch-analysis-lc-pinyin
是一款
elasticsearch
拼音分词插件,可以支持按照全拼、首字母,中文混合搜索。 例如我们在某宝搜索框中输入“jianpan” 可以搜索到关键字包含“键盘”的商品。不仅仅输入全拼,有时候我们输入首字母、拼音和首字母、中文和首字母的混合输入,比如:“键pan”、“j盘”、“jianp”、“jpan”、“jianp”、“jp” 等等,都应该匹配到键盘。通过elasticsearch-analysis-lc-pinyin这个插件就能做到类似的搜索
  • 此拼音插件主要用在
    短文档
    的搜索上,如文章的标题、作者,商品的品牌等,不建议用在
    长文档

分析器 - Analyzer

  • lc_index
    : 该分词器用于索引数据时指定,将中文转换为全拼和首字,同时保留中文
  • lc_search
    : 该分词器用于拼音搜索时指定,按最小拼音分词个数拆分拼音,优先拆分全拼

分词器 - Tokenizer

  • lc_index
    :参数 mode: full_pinyinfirst_letterchinese_char
  • lc_search
    :参数 mode: smart_pinyinsingle_letter

过滤器 - TokenFilter

  • lc_full_pinyin
    :将中文Token转全拼(支持多音字)
  • lc_first_letter
    :将中文Token转首字母(支持多音字)

过滤器使用示例

以下是使用ik分词结合拼音过滤器使用例子: 1. 创建一个索引,并定义分析器

bash
PUT /index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "ik_letter_smart": {
          "type": "custom",
          "tokenizer": "ik_max_word",
          "filter": [
            "lc_first_letter"
          ]
        },
        "ik_py_smart": {
          "type": "custom",
          "tokenizer": "ik_max_word",
          "filter": [
            "lc_full_pinyin"
          ]
        }
      }
    }
  }
}
2. 测试分词 ```bash curl -XGET 'http://192.168.0.109:9200/index/analyze?analyzer=ikpy_smart&pretty&text=英雄难过美人关'

返回词条:

{ "tokens" : [ { "token" : "yingxiongnanguomeirenguan", "startoffset" : 0, "endoffset" : 7, "type" : "CNWORD", "position" : 0 }, { "token" : "yingxiongnanguo", "startoffset" : 0, "endoffset" : 4, "type" : "CNWORD", "position" : 1 }, { "token" : "yingxiong", "startoffset" : 0, "endoffset" : 2, "type" : "CNWORD", "position" : 2 }, { "token" : "nanguo", "startoffset" : 2, "endoffset" : 4, "type" : "CNWORD", "position" : 3 }, { "token" : "meirenguan", "startoffset" : 4, "endoffset" : 7, "type" : "CNWORD", "position" : 4 }, { "token" : "meiren", "startoffset" : 4, "endoffset" : 6, "type" : "CNWORD", "position" : 5 }, { "token" : "guan", "startoffset" : 6, "endoffset" : 7, "type" : "CN_CHAR", "position" : 6 } ] } ```

分析器使用示例

1.创建一个索引

index
curl -XPUT http://localhost:9200/index

2.创建类型

brand
的mapping
curl -XPOST http://localhost:9200/index/_mapping/brand -d'
{
  "brand": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "lc_index",
        "search_analyzer": "lc_search",
        "term_vector": "with_positions_offsets"
      }
    }
  }
}'

3.索引一些互联网品牌

curl -XPOST http://localhost:9200/index/brand/1 -d'{"name":"百度"}'
curl -XPOST http://localhost:9200/index/brand/8 -d'{"name":"百度糯米"}'
curl -XPOST http://localhost:9200/index/brand/2 -d'{"name":"阿里巴巴"}'
curl -XPOST http://localhost:9200/index/brand/3 -d'{"name":"腾讯科技"}'
curl -XPOST http://localhost:9200/index/brand/4 -d'{"name":"网易游戏"}'
curl -XPOST http://localhost:9200/index/brand/9 -d'{"name":"大众点评"}'
curl -XPOST http://localhost:9200/index/brand/10 -d'{"name":"携程旅行网"}'

4.编写高亮查询DSL

此示例通过

lc_search
分词器配合
match_phrase
查询实现品牌的
全拼
搜索 搜索全拼关键字
baidu
,请求DSL如下:
bash
curl -XPOST http://localhost:9200/index/brand/_search  -d'
{
    "query": {
        "match": {
          "name": {
            "query": "baidu",
            "analyzer": "lc_search",
            "type": "phrase"
          }
        }
    },
    "highlight" : {
        "pre_tags" : [""],
        "post_tags" : [""],
        "fields" : {
            "name" : {}
        }
    }
}'

匹配到

百度
百度糯米
两个品牌 tip:
百度
排在
百度糯米
的前面,因为name字段长度更短 查询结果:
bash
{
    "took": 18,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 2.5751648,
        "hits": [
            {
                "_index": "index",
                "_type": "brand",
                "_id": "1",
                "_score": 2.5751648,
                "_source": {
                    "name": "百度"
                },
                "highlight": {
                    "name": [
                        "百度"
                    ]
                }
            }
            ,
            {
                "_index": "index",
                "_type": "brand",
                "_id": "8",
                "_score": 2.0601318,
                "_source": {
                    "name": "百度糯米"
                },
                "highlight": {
                    "name": [
                        "百度糯米"
                    ]
                }
            }
        ]
    }
}

此示例通过

lc_search
分词器配合
match_phrase
查询实现品牌的
中文&全拼
搜索 搜索全拼关键字
xie程lu行wang
,请求DSL如下:
bash
curl -XPOST http://localhost:9200/index/brand/_search  -d'
{
    "query": {
        "match": {
          "name": {
            "query": "xie程lu行",
            "analyzer": "lc_search",
            "type": "phrase"
          }
        }
    },
    "highlight" : {
        "pre_tags" : [""],
        "post_tags" : [""],
        "fields" : {
            "name" : {}
        }
    }
}'

匹配到

携程旅行网
结果如下: ```bash

匹配到
携程旅行网
结果如下:

{ "took": 4, "timedout": false, "shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 1, "maxscore": 4.5665164, "hits": [ { "index": "index", "type": "brand", "id": "10", "score": 4.5665164, "source": { "name": "携程旅行网" }, "highlight": { "name": [ "携程旅行网" ] } } ] } }

```bash
# 此示例通过`lc_search`分词器配合`match_phrase`查询实现品牌的`首字母`搜索
# 此示例中也可以通过`lc_first_letter`分词器搜索,结果和`lc_search`一样
#
# 这两个分词器的主要区别:
#     lc_first_letterl 会把所有输入的字母拆分成单字母用于首字母匹配
#     lc_search        会优先把输入的字母串拆分成全拼,并找到一个最优拆分结果
#
# 搜索全拼关键字`albb`,请求DSL如下:
curl -XPOST http://localhost:9200/index/brand/_search  -d'
{
    "query": {
        "match": {
          "name": {
            "query": "albb",
            "analyzer": "lc_search",
            "type": "phrase"
          }
        }
    },
    "highlight" : {
        "pre_tags" : [""],
        "post_tags" : [""],
        "fields" : {
            "name" : {}
        }
    }
}'

匹配到

阿里巴巴
,结果如下:
bash
{
    "took": 4,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 3.9560113,
        "hits": [
            {
                "_index": "index",
                "_type": "brand",
                "_id": "2",
                "_score": 3.9560113,
                "_source": {
                    "name": "阿里巴巴"
                },
                "highlight": {
                    "name": [
                        "阿里巴巴"
                    ]
                }
            }
        ]
    }
}
//java api实现
//该查询会匹配`阿里巴巴`这条数据
QueryBuilder pinyinQueryBuilder =  QueryBuilders.matchPhraseQuery("name", "ali巴b").analyzer("lc_search");
        SearchRequestBuilder requestBuilder = client.prepareSearch("index").setTypes("brand");
        requestBuilder.setQuery(pinyinQueryBuilder)
                .setHighlighterPreTags("")
                .setHighlighterPostTags("")
                .addHighlightedField("name")
                .execute().actionGet();


//该查询会匹配大众点评这条数据 QueryBuilder pinyinQueryBuilder = QueryBuilders.matchPhraseQuery("name", "dzdp").analyzer("lc_first_letter"); SearchRequestBuilder requestBuilder = client.prepareSearch("index").setTypes("brand"); requestBuilder.setQuery(pinyinQueryBuilder) .setHighlighterPreTags("") .setHighlighterPostTags("") .addHighlightedField("name") .execute().actionGet();

//该查询也会匹配大众点评这条数据 QueryBuilder pinyinQueryBuilder = QueryBuilders.matchPhraseQuery("name", "dzdp").analyzer("lc_search"); SearchRequestBuilder requestBuilder = client.prepareSearch("index").setTypes("brand"); requestBuilder.setQuery(pinyinQueryBuilder) .setHighlighterPreTags("") .setHighlighterPostTags("") .addHighlightedField("name") .execute().actionGet();

作者: @陈楠 Email: [email protected]

<完>

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.