甲言,专注于古代汉语(古汉语/古文/文言文/文言)处理的NLP工具包,支持文言词库构建、分词、词性标注、断句和标点。Jiayan, the NLP toolkit designed for Classical Chinese, supports lexicon construction, tokenizing, POS tagging, sentence segmentation and punctuation.
甲言,取「甲骨文言」之意,是一款专注于古汉语处理的NLP工具包。
目前通用的汉语NLP工具均以现代汉语为核心语料,对古代汉语的处理效果很差(详见分词)。本项目的初衷,便是辅助古汉语信息处理,帮助有志于挖掘古文化矿藏的古汉语学者、爱好者等更好地分析和利用文言资料,从「文化遗产」中创造出「文化新产」。
当前版本支持词库构建、自动分词、词性标注、文言句读和标点五项功能,更多功能正在开发中。
$ pip install jiayan $ pip install https://github.com/kpu/kenlm/archive/master.zip
以下各模块的使用方法均来自examples.py。 1. 下载模型并解压:百度网盘,提取码:
p0sc* jiayan.klm:语言模型,主要用来分词,以及句读标点任务中的特征提取;
constructor = PMIEntropyLexiconConstructor() lexicon = constructor.construct_lexicon('庄子.txt') constructor.save(lexicon, '庄子词库.csv') ```
结果:
Word,Frequency,PMI,R_Entropy,L_Entropy 之,2999,80,7.944909328101839,8.279435615456894 而,2089,80,7.354575005231323,8.615211168836439 不,1941,80,7.244331150611089,6.362131306822925 ... 天下,280,195.23602384978196,5.158574399464853,5.24731990592901 圣人,111,150.0620531154239,4.622606551534004,4.6853474419338585 万物,94,377.59805590304126,4.5959107835319895,4.538837960294887 天地,92,186.73504238078462,3.1492586603863617,4.894533538722486 孔子,80,176.2550051738876,4.284638190120882,2.4056390622295662 庄子,76,169.26227942514097,2.328252899085616,2.1920058354921066 仁义,58,882.3468468468468,3.501609497059026,4.96900162987599 老聃,45,2281.2228260869565,2.384853500510039,2.4331958387289765 ...3. 分词
jiayan.klm``` from jiayan import load_lm from jiayan import CharHMMTokenizer
text = '是故内圣外王之道,暗而不明,郁而不发,天下之人各为其所欲焉以自为方。'lm = load_lm('jiayan.klm') tokenizer = CharHMMTokenizer(lm) print(list(tokenizer.tokenize(text))) ``` 结果: `['是', '故', '内圣外王', '之', '道', ',', '暗', '而', '不', '明', ',', '郁', '而', '不', '发', ',', '天下', '之', '人', '各', '为', '其', '所', '欲', '焉', '以', '自', '为', '方', '。']` 由于古汉语没有公开分词数据,无法做效果评估,但我们可以通过不同NLP工具对相同句子的处理结果来直观感受本项目的优势: 试比较 [LTP](https://github.com/HIT-SCIR/ltp) (3.4.0) 模型分词结果: `['是', '故内', '圣外王', '之', '道', ',', '暗而不明', ',', '郁', '而', '不', '发', ',', '天下', '之', '人', '各', '为', '其', '所', '欲', '焉以自为方', '。']` 再试比较 [HanLP](http://hanlp.com) 分词结果: `['是故', '内', '圣', '外', '王之道', ',', '暗', '而', '不明', ',', '郁', '而', '不', '发', ',', '天下', '之', '人', '各为其所欲焉', '以', '自为', '方', '。']` 可见本工具对古汉语的分词效果明显优于通用汉语NLP工具。
词级最大概率路径分词,基本以字为单位,颗粒度较粗
from jiayan import WordNgramTokenizer text = '是故内圣外王之道,暗而不明,郁而不发,天下之人各为其所欲焉以自为方。' tokenizer = WordNgramTokenizer() print(list(tokenizer.tokenize(text)))
结果:
['是', '故', '内', '圣', '外', '王', '之', '道', ',', '暗', '而', '不', '明', ',', '郁', '而', '不', '发', ',', '天下', '之', '人', '各', '为', '其', '所', '欲', '焉', '以', '自', '为', '方', '。']
词性标注 ``` from jiayan import CRFPOSTagger
words = ['天下', '大乱', ',', '贤圣', '不', '明', ',', '道德', '不', '一', ',', '天下', '多', '得', '一', '察', '焉', '以', '自', '好', '。']
postagger = CRFPOSTagger() postagger.load('pos_model') print(postagger.postag(words)) ``
结果:['n', 'a', 'wp', 'n', 'd', 'a', 'wp', 'n', 'd', 'm', 'wp', 'n', 'a', 'u', 'm', 'v', 'r', 'p', 'r', 'a', 'wp']`
断句 ``` from jiayan import load_lm from jiayan import CRFSentencizer
text = '天下大乱贤圣不明道德不一天下多得一察焉以自好譬如耳目皆有所明不能相通犹百家众技也皆有所长时有所用虽然不该不遍一之士也判天地之美析万物之理察古人之全寡能备于天地之美称神之容是故内圣外王之道暗而不明郁而不发天下之人各为其所欲焉以自为方悲夫百家往而不反必不合矣后世之学者不幸不见天地之纯古之大体道术将为天下裂'
lm = loadlm('jiayan.klm') sentencizer = CRFSentencizer(lm) sentencizer.load('cutmodel') print(sentencizer.sentencize(text)) ``
结果:['天下大乱', '贤圣不明', '道德不一', '天下多得一察焉以自好', '譬如耳目', '皆有所明', '不能相通', '犹百家众技也', '皆有所长', '时有所用', '虽然', '不该不遍', '一之士也', '判天地之美', '析万物之理', '察古人之全', '寡能备于天地之美', '称神之容', '是故内圣外王之道', '暗而不明', '郁而不发', '天下之人各为其所欲焉以自为方', '悲夫', '百家往而不反', '必不合矣', '后世之学者', '不幸不见天地之纯', '古之大体', '道术将为天下裂']`
标点 ``` from jiayan import load_lm from jiayan import CRFPunctuator
text = '天下大乱贤圣不明道德不一天下多得一察焉以自好譬如耳目皆有所明不能相通犹百家众技也皆有所长时有所用虽然不该不遍一之士也判天地之美析万物之理察古人之全寡能备于天地之美称神之容是故内圣外王之道暗而不明郁而不发天下之人各为其所欲焉以自为方悲夫百家往而不反必不合矣后世之学者不幸不见天地之纯古之大体道术将为天下裂'
lm = loadlm('jiayan.klm') punctuator = CRFPunctuator(lm, 'cutmodel') punctuator.load('punc_model') print(punctuator.punctuate(text)) ``
结果:天下大乱,贤圣不明,道德不一,天下多得一察焉以自好,譬如耳目,皆有所明,不能相通,犹百家众技也,皆有所长,时有所用,虽然,不该不遍,一之士也,判天地之美,析万物之理,察古人之全,寡能备于天地之美,称神之容,是故内圣外王之道,暗而不明,郁而不发,天下之人各为其所欲焉以自为方,悲夫!百家往而不反,必不合矣,后世之学者,不幸不见天地之纯,古之大体,道术将为天下裂。`
Jiayan, which means Chinese characters engraved on oracle bones, is a professional Python NLP tool for Classical Chinese.
Prevailing Chinese NLP tools are mainly trained on modern Chinese data, which leads to bad performance on Classical Chinese (See Tokenizing). The purpose of this project is to assist Classical Chinese information processing.
Current version supports lexicon construction, tokenizing, POS tagging, sentence segmentation and automatic punctuation, more features are in development.
$ pip install jiayan $ pip install https://github.com/kpu/kenlm/archive/master.zip
The usage codes below are all from examples.py.
1. Download the models and unzip them:Google Drive
* jiayan.klm:the language model used for tokenizing and feature extraction for sentence segmentation and punctuation;
* posmodel:the CRF model for POS tagging;
* cutmodel:the CRF model for sentence segmentation;
* punc_model:the CRF model for punctuation;
* 庄子.txt:the full text of 《Zhuangzi》 used for testing lexicon construction.
constructor = PMIEntropyLexiconConstructor() lexicon = constructor.constructlexicon('庄子.txt') constructor.save(lexicon, 'ZhuangziLexicon.csv') ```
Result:
Word,Frequency,PMI,R_Entropy,L_Entropy 之,2999,80,7.944909328101839,8.279435615456894 而,2089,80,7.354575005231323,8.615211168836439 不,1941,80,7.244331150611089,6.362131306822925 ... 天下,280,195.23602384978196,5.158574399464853,5.24731990592901 圣人,111,150.0620531154239,4.622606551534004,4.6853474419338585 万物,94,377.59805590304126,4.5959107835319895,4.538837960294887 天地,92,186.73504238078462,3.1492586603863617,4.894533538722486 孔子,80,176.2550051738876,4.284638190120882,2.4056390622295662 庄子,76,169.26227942514097,2.328252899085616,2.1920058354921066 仁义,58,882.3468468468468,3.501609497059026,4.96900162987599 老聃,45,2281.2228260869565,2.384853500510039,2.4331958387289765 ...3. Tokenizing
jiayan.klm``` from jiayan import load_lm from jiayan import CharHMMTokenizer
text = '是故内圣外王之道,暗而不明,郁而不发,天下之人各为其所欲焉以自为方。'lm = load_lm('jiayan.klm') tokenizer = CharHMMTokenizer(lm) print(list(tokenizer.tokenize(text))) ``` Result: `['是', '故', '内圣外王', '之', '道', ',', '暗', '而', '不', '明', ',', '郁', '而', '不', '发', ',', '天下', '之', '人', '各', '为', '其', '所', '欲', '焉', '以', '自', '为', '方', '。']` Since there is no public tokenizing data for Classical Chinese, it's hard to do performance evaluation directly; However, we can compare the results with other popular modern Chinese NLP tools to check the performance: Compare the tokenizing result of [LTP](https://github.com/HIT-SCIR/ltp) (3.4.0): `['是', '故内', '圣外王', '之', '道', ',', '暗而不明', ',', '郁', '而', '不', '发', ',', '天下', '之', '人', '各', '为', '其', '所', '欲', '焉以自为方', '。']` Also, compare the tokenizing result of [HanLP](http://hanlp.com): `['是故', '内', '圣', '外', '王之道', ',', '暗', '而', '不明', ',', '郁', '而', '不', '发', ',', '天下', '之', '人', '各为其所欲焉', '以', '自为', '方', '。']` It's apparent that Jiayan has much better tokenizing performance than general Chinese NLP tools.
Max probability path approach tokenizing based on words
from jiayan import WordNgramTokenizer text = '是故内圣外王之道,暗而不明,郁而不发,天下之人各为其所欲焉以自为方。' tokenizer = WordNgramTokenizer() print(list(tokenizer.tokenize(text)))
Result:
['是', '故', '内', '圣', '外', '王', '之', '道', ',', '暗', '而', '不', '明', ',', '郁', '而', '不', '发', ',', '天下', '之', '人', '各', '为', '其', '所', '欲', '焉', '以', '自', '为', '方', '。']
POS Tagging ``` from jiayan import CRFPOSTagger
words = ['天下', '大乱', ',', '贤圣', '不', '明', ',', '道德', '不', '一', ',', '天下', '多', '得', '一', '察', '焉', '以', '自', '好', '。']
postagger = CRFPOSTagger() postagger.load('pos_model') print(postagger.postag(words)) ``
Result:['n', 'a', 'wp', 'n', 'd', 'a', 'wp', 'n', 'd', 'm', 'wp', 'n', 'a', 'u', 'm', 'v', 'r', 'p', 'r', 'a', 'wp']`
Sentence Segmentation ``` from jiayan import load_lm from jiayan import CRFSentencizer
text = '天下大乱贤圣不明道德不一天下多得一察焉以自好譬如耳目皆有所明不能相通犹百家众技也皆有所长时有所用虽然不该不遍一之士也判天地之美析万物之理察古人之全寡能备于天地之美称神之容是故内圣外王之道暗而不明郁而不发天下之人各为其所欲焉以自为方悲夫百家往而不反必不合矣后世之学者不幸不见天地之纯古之大体道术将为天下裂'
lm = loadlm('jiayan.klm') sentencizer = CRFSentencizer(lm) sentencizer.load('cutmodel') print(sentencizer.sentencize(text)) ``
Result:['天下大乱', '贤圣不明', '道德不一', '天下多得一察焉以自好', '譬如耳目', '皆有所明', '不能相通', '犹百家众技也', '皆有所长', '时有所用', '虽然', '不该不遍', '一之士也', '判天地之美', '析万物之理', '察古人之全', '寡能备于天地之美', '称神之容', '是故内圣外王之道', '暗而不明', '郁而不发', '天下之人各为其所欲焉以自为方', '悲夫', '百家往而不反', '必不合矣', '后世之学者', '不幸不见天地之纯', '古之大体', '道术将为天下裂']`
Punctuation ``` from jiayan import load_lm from jiayan import CRFPunctuator
text = '天下大乱贤圣不明道德不一天下多得一察焉以自好譬如耳目皆有所明不能相通犹百家众技也皆有所长时有所用虽然不该不遍一之士也判天地之美析万物之理察古人之全寡能备于天地之美称神之容是故内圣外王之道暗而不明郁而不发天下之人各为其所欲焉以自为方悲夫百家往而不反必不合矣后世之学者不幸不见天地之纯古之大体道术将为天下裂'
lm = loadlm('jiayan.klm') punctuator = CRFPunctuator(lm, 'cutmodel') punctuator.load('punc_model') print(punctuator.punctuate(text)) ``
Result:天下大乱,贤圣不明,道德不一,天下多得一察焉以自好,譬如耳目,皆有所明,不能相通,犹百家众技也,皆有所长,时有所用,虽然,不该不遍,一之士也,判天地之美,析万物之理,察古人之全,寡能备于天地之美,称神之容,是故内圣外王之道,暗而不明,郁而不发,天下之人各为其所欲焉以自为方,悲夫!百家往而不反,必不合矣,后世之学者,不幸不见天地之纯,古之大体,道术将为天下裂。`