elasticsearch-analysis-stconvert

by medcl

STConvert is analyzer that convert chinese characters between traditional and simplified.中文简繁體互相转换.

230 Stars 53 Forks Last release: about 1 month ago (v7.9.2) Apache License 2.0 96 Commits 120 Releases

Available items

No Items, yet!

The developer of this repository has not created any items for sale yet. Need a bug fixed? Help with integration? A different license? Create a request here:

STConvert Analysis for Elasticsearch

STConvert is analyzer that convert Chinese characters between Traditional and Simplified. [中文简繁體转换][简体到繁体][繁体到简体][简繁查询Expand]

You can download the pre-build package from release page

The plugin includes analyzer:

stconvert
, tokenizer:
stconvert
, token-filter:
stconvert
, and char-filter:
stconvert

Supported config:

  • convert_type
    : default
    s2t
    ,optional option:
    1. s2t
      ,convert characters from Simple Chinese to Traditional Chinese
    2. t2s
      ,convert characters from Traditional Chinese to Simple Chinese
  • keep_both
    :default
    false
    ,
  • delimiter
    :default
    ,

Custom example:

PUT /stconvert/
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "tsconvert" : {
                    "tokenizer" : "tsconvert"
                    }
            },
            "tokenizer" : {
                "tsconvert" : {
                    "type" : "stconvert",
                    "delimiter" : "#",
                    "keep_both" : false,
                    "convert_type" : "t2s"
                }
            },   
             "filter": {
               "tsconvert" : {
                     "type" : "stconvert",
                     "delimiter" : "#",
                     "keep_both" : false,
                     "convert_type" : "t2s"
                 }
             },
            "char_filter" : {
                "tsconvert" : {
                    "type" : "stconvert",
                    "convert_type" : "t2s"
                }
            }
        }
    }
}

Analyze tests

GET stconvert/_analyze
{
  "tokenizer" : "keyword",
  "filter" : ["lowercase"],
  "char_filter" : ["tsconvert"],
  "text" : "国际國際"
}

Output: { "tokens": [ { "token": "国际国际", "start_offset": 0, "end_offset": 4, "type": "word", "position": 0 } ] }

Normalizer usage

DELETE index
PUT index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "tsconvert": {
          "type": "stconvert",
          "convert_type": "t2s"
        }
      },
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": [
            "tsconvert"
          ],
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "foo": {
        "type": "keyword",
        "normalizer": "my_normalizer"
      }
    }
  }
}

PUT index/_doc/1 { "foo": "國際" }

PUT index/_doc/2 { "foo": "国际" }

GET index/_search { "query": { "term": { "foo": "国际" } } }

GET index/_search { "query": { "term": { "foo": "國際" } } }

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.