Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
Kuromoji is an easy to use and self-contained Japanese morphological analyzer that does
Several other features are supported. Please consult each dictionaries'
Tokenclass for details.
The example below shows how to use the Kuromoji morphological analyzer in its simlest form; to segment text into tokens and output features for each token.
package com.atilika.kuromoji.example;import com.atilika.kuromoji.ipadic.Token; import com.atilika.kuromoji.ipadic.Tokenizer; import java.util.List;
public class KuromojiExample { public static void main(String[] args) { Tokenizer tokenizer = new Tokenizer() ; List tokens = tokenizer.tokenize("お寿司が食べたい。"); for (Token token : tokens) { System.out.println(token.getSurface() + "\t" + token.getAllFeatures()); } } }
Make sure you add the dependency below to your
pom.xmlbefore building your project.
com.atilika.kuromoji kuromoji-ipadic 0.9.0
When running the above program, you will get the following output:
お 接頭詞,名詞接続,*,*,*,*,お,オ,オ 寿司 名詞,一般,*,*,*,*,寿司,スシ,スシ が 助詞,格助詞,一般,*,*,*,が,ガ,ガ 食べ 動詞,自立,*,*,一段,連用形,食べる,タベ,タベ たい 助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイ 。 記号,句点,*,*,*,*,。,。,。
See the documentation for the
com.atilika.kuromoji.ipadic.Tokenclass for more information on the per-token features available.
Kuromoji currently supports the following dictionaries: - IPADIC (2.7.0-20070801) - IPADIC NEologd (2.7.0-20070801-neologd-20171113) - JUMANDIC (7.0-20130310) - NAIST jdic (0.6.3b-20111013) - UniDic (2.1.2) - UniDic Kana Accent (2.1.2) - UniDic NEologd (2.1.2-neologd-20171002)
Question: So which of these dictionaries should I use?
Answer: That depends on your application. Yes, we know - it's a boring answer... :)
If you are not sure about which dictionary you should use,
kuromoji-ipadicis a good starting point for many applications.
See the getters in the per-dictionary
Tokenclasses for some more information on available token features - or consult the technical dictionary documentation elsewhere. (We plan on adding better guidance on choosing a dictionary.)
Each dictionary has its own Maven coordinates, and a
Tokenizerand a
Tokenclass similar to that in the above example. These classes live in a designated packaged space indicated by the dictionary type.
The sections below list fully qualified class names and the Maven coordinates for each dictionary supported.
com.atilika.kuromoji.ipadic.Tokenizer
com.atilika.kuromoji.ipadic.Token
com.atilika.kuromoji kuromoji-ipadic 0.9.0
com.atilika.kuromoji.ipadic.neologd.Tokenizer
com.atilika.kuromoji.ipadic.neologd.Token
This dictionary will be available from Maven Central in a future version.
com.atilika.kuromoji.jumandic.Tokenizer
com.atilika.kuromoji.jumandic.Token
com.atilika.kuromoji kuromoji-jumandic 0.9.0
com.atilika.kuromoji.naist.jdic.Tokenizer
com.atilika.kuromoji.naist.jdic.Token
com.atilika.kuromoji kuromoji-naist-jdic 0.9.0
com.atilika.kuromoji.unidic.Tokenizer
com.atilika.kuromoji.unidic.Token
com.atilika.kuromoji kuromoji-unidic 0.9.0
com.atilika.kuromoji.unidic.kanaaccent.Tokenizer
com.atilika.kuromoji.unidic.kanaaccent.Token
com.atilika.kuromoji kuromoji-unidic-kanaaccent 0.9.0
com.atilika.kuromoji.unidic.neologd.Tokenizer
com.atilika.kuromoji.unidic.kanaaneologdcent.Token
This dictionary will be available from Maven Central in a future version.
Released version of Kuromoji are available from Maven Central.
If you want to build Kuromoji from source code, run the following command:
$ mvn clean package
This will download all source dictionary data and build Kuromoji with all dictionaries. The following jars will then be available:
kuromoji-core/target/kuromoji-core-1.0-SNAPSHOT.jar kuromoji-ipadic/target/kuromoji-ipadic-1.0-SNAPSHOT.jar kuromoji-ipadic-neologd/target/kuromoji-ipadic-neologd-1.0-SNAPSHOT.jar kuromoji-jumandic/target/kuromoji-jumandic-1.0-SNAPSHOT.jar kuromoji-naist-jdic/target/kuromoji-naist-jdic-1.0-SNAPSHOT.jar kuromoji-unidic/target/kuromoji-unidic-1.0-SNAPSHOT.jar kuromoji-unidic-kanaaccent/target/kuromoji-unidic-kanaaccent-1.0-SNAPSHOT.jar kuromoji-unidic-neologd/target/kuromoji-unidic-neologd-1.0-SNAPSHOT.jar
The following additional build options are available:
-DskipCompileDictionaryDo not recompile the dictionaries
-DskipDownloadDictionaryDo not download source dictionaries
-DbenchmarkTokenizersProfile each tokenizer during the package phase using content from Japanese Wikipedia
-DskipDownloadWikipediaPrevent the compressed version of the Japanese Wikipedia (~765 MB) from being downloaded during profiling, i.e. if it has already been downloaded.
Kuromoji is licensed under the Apache License, Version 2.0. See
LICENSE.mdfor details.
This software also includes a binary and/or source version of data from various 3rd party dictionaries. See
NOTICE.mdfor these details.
Please open up issues if you have a feature request. We also welcome contributions through pull requests.
You will retain copyright to your own contributions, but you need to license them using the Apache License, Version 2.0. All contributors will be mentioned in the
CONTRIBUTORS.mdfile.
We are a small team of experienced software engineers based in Tokyo who offers technologies and good advice in the field of search, natural language processing and big data analytics.
Please feel free to contact us at [email protected] if you have any questions or need help.