Need help with PyKoSpacing?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

haven-jeon
225 Stars 69 Forks GNU General Public License v3.0 46 Commits 0 Opened issues

Description

Automatic Korean word spacing with Python

Services available

!
?

Need anything else?

Contributors list

# 55,409
korean-...
Jupyter...
R
C++
16 commits
# 429,844
R
Shell
TeX
korean-...
2 commits
# 14,079
JavaScr...
wysiwyg
Django
Jekyll
2 commits
# 504,832
Python
korean-...
text-pr...
1 commit
# 53,915
korean-...
Shell
CSS
Jupyter...
1 commit

PyKoSpacing

Python package for automatic Korean word spacing.

R verson can be found here.

License: GPL v3

Introduction

Word spacing is one of the important parts of the preprocessing of Korean text analysis. Accurate spacing greatly affects the accuracy of subsequent text analysis.

PyKoSpacing
has fairly accurate automatic word spacing performance,especially good for online text originated from SNS or SMS.

For example.

"아버지가방에들어가신다." can be spaced both of below.

  1. "아버지가 방에 들어가신다." means "My father enters the room."
  2. "아버지 가방에 들어가신다." means "My father goes into the bag."

Common sense, the first is the right answer.

PyKoSpacing
is based on Deep Learning model trained from large corpus(more than 100 million NEWS articles from Chan-Yub Park).

Performance

| Test Set | Accuracy | |---|---| | Sejong(colloquial style) Corpus(1M) | 97.1% | | OOOO(literary style) Corpus(3M) | 94.3% |

  • Accuracy = # correctly spaced characters/# characters in the test data.
    • Might be increased performance if normalize compound words.

Install

PyPI Install

Pre-requisite: ```bash proper installation of python3 proper installation of pip

pip install tensorflow pip install keras

Windows-Ubuntu case: On following error. On error: /usr/lib/x8664-linux-gnu/libstdc++.so.6: version `GLIBCXX3.4.22' not found sudo apt-get install libstdc++6 sudo add-apt-repository ppa:ubuntu-toolchain-r/test sudo apt-get update sudo apt-get upgrade sudo apt-get dist-upgrade (This takes long time.) ```

To install from GitHub, use

pip install git+https://github.com/haven-jeon/PyKoSpacing.git

Example

>>> from pykospacing import Spacing
>>> spacing = Spacing()
>>> spacing("김형호영화시장분석가는'1987'의네이버영화정보네티즌10점평에서언급된단어들을지난해12월27일부터올해1월10일까지통계프로그램R과KoNLP패키지로텍스트마이닝하여분석했다.")
"김형호 영화시장 분석가는 '1987'의 네이버 영화 정보 네티즌 10점 평에서 언급된 단어들을 지난해 12월 27일부터 올해 1월 10일까지 통계 프로그램 R과 KoNLP 패키지로 텍스트마이닝하여 분석했다."
>>> # Apply a list of words that must be non-spacing
>>> spacing('귀밑에서턱까지잇따라난수염을구레나룻이라고한다.')
'귀 밑에서 턱까지 잇따라 난 수염을 구레나 룻이라고 한다.'
>>> spacing = Spacing(rules=['구레나룻'])
>>> spacing('귀밑에서턱까지잇따라난수염을구레나룻이라고한다.')
'귀 밑에서 턱까지 잇따라 난 수염을 구레나룻이라고 한다.'

Setting rules with csv file. (you only need to use

set_rules_by_csv()
method.)
$ cat test.csv
인덱스,단어
1,네이버영화
2,언급된단어
>>> from pykospacing import Spacing
>>> spacing = Spacing(rules=[''])
>>> spacing.set_rules_by_csv('./test.csv', '단어')
>>> spacing("김형호영화시장분석가는'1987'의네이버영화정보네티즌10점평에서언급된단어들을지난해12월27일부터올해1월10일까지통계프로그램R과KoNLP패키지로텍스트마이닝하여분석했다.")
"김형호 영화시장 분석가는 '1987'의 네이버영화 정보 네티즌 10점 평에서 언급된단어들을 지난해 12월 27일부터 올해 1월 10일까지 통계 프로그램 R과 KoNLP 패키지로 텍스트마이닝하여 분석했다."

Run on command line(thanks lqez).

$ cat test_in.txt
김형호영화시장분석가는'1987'의네이버영화정보네티즌10점평에서언급된단어들을지난해12월27일부터올해1월10일까지통계프로그램R과KoNLP패키지로텍스트마이닝하여분석했다.
아버지가방에들어가신다.
$ python -m pykospacing.pykos test_in.txt
김형호 영화시장 분석가는 '1987'의 네이버 영화 정보 네티즌 10점 평에서 언급된 단어들을 지난해 12월 27일부터 올해 1월 10일까지 통계 프로그램 R과 KoNLP 패키지로 텍스트마이닝하여 분석했다.
아버지가 방에 들어가신다.

Model Architecture

For Training

  • Training code uses an architecture that is more advanced than PyKoSpacing, but also contains the learning logic of PyKoSpacing.
    • https://github.com/haven-jeon/Train_KoSpacing

Citation

@misc{heewon2018,
author = {Heewon Jeon},
title = {KoSpacing: Automatic Korean word spacing},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/haven-jeon/KoSpacing}}

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.