Chinese word segmentation algorithm without corpus(无需语料库的中文分词)
Chinese word segmentation algorithm without corpus
from wordseg import WordSegment doc = u'十四是十四四十是四十,十四不是四十,四十不是十四' ws = WordSegment(doc, max_word_len=2, min_aggregation=1, min_entropy=0.5) ws.segSentence(doc)
This will generate words
十四 是 十四 四十 是 四十 , 十四 不是 四十 , 四十 不是 十四
In fact,
docshould be a long enough document string for better results. In that condition, the minaggregation should be set far greater than 1, such as 50, and minentropy should also be set greater than 0.5, such as 1.5.
Besides, both input and output of this function should be decoded as unicode.
WordSegment.segSentencehas an optional argument
method, with values
WordSegment.L,
WordSegment.Sand
WordSegment.ALL, means
WordSegment.L: if a long word that is combinations of several shorter words found, given only the long word.
WordSegment.S: given the several shorter words.
WordSegment.ALL: given both the long and the shorters.
Thanks Matrix67's article