Need help with ChineseWordSegmentation?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

Moonshile
446 Stars 126 Forks MIT License 17 Commits 3 Opened issues

Description

Chinese word segmentation algorithm without corpus(无需语料库的中文分词)

Services available

!
?

Need anything else?

Contributors list

# 175,833
Python
15 commits
# 112,208
Nim
Shell
swig
jieba
1 commit

ChineseWordSegmentation

Chinese word segmentation algorithm without corpus

Usage

from wordseg import WordSegment
doc = u'十四是十四四十是四十,十四不是四十,四十不是十四'
ws = WordSegment(doc, max_word_len=2, min_aggregation=1, min_entropy=0.5)
ws.segSentence(doc)

This will generate words

十四 是 十四 四十 是 四十 , 十四 不是 四十 , 四十 不是 十四

In fact,

doc
should be a long enough document string for better results. In that condition, the minaggregation should be set far greater than 1, such as 50, and minentropy should also be set greater than 0.5, such as 1.5.

Besides, both input and output of this function should be decoded as unicode.

WordSegment.segSentence
has an optional argument
method
, with values
WordSegment.L
,
WordSegment.S
and
WordSegment.ALL
, means
  • WordSegment.L
    : if a long word that is combinations of several shorter words found, given only the long word.
  • WordSegment.S
    : given the several shorter words.
  • WordSegment.ALL
    : given both the long and the shorters.

Reference

Thanks Matrix67's article

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.