Need help with pythainlp?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

PyThaiNLP
602 Stars 201 Forks Apache License 2.0 3.1K Commits 22 Opened issues

Description

Thai Natural Language Processing in Python.

Services available

!
?

Need anything else?

Contributors list

PyThaiNLP: Thai Natural Language Processing in Python

pypi Python 3.6 License Download Build status Coverage Status Codacy Badge FOSSA Status Google Colab Badge DOI

PyThaiNLP is a Python package for text processing and linguistic analysis, similar to NLTK with focus on Thai language.

PyThaiNLP เป็นไลบารีภาษาไพทอนสำหรับประมวลผลภาษาธรรมชาติ คล้ายกับ NLTK โดยเน้นภาษาไทย ดูรายละเอียดภาษาไทยได้ที่ README_TH.MD

| Version | Description | Status | |:------:|:--:|:------:| | 2.3.1 | Stable | Change Log | |

dev
| Release Candidate for 2.4 | Change Log |

Getting Started

Capabilities

PyThaiNLP provides standard NLP functions for Thai, for example part-of-speech tagging, linguistic unit segmentation (syllable, word, or sentence). Some of these functions are also available via command-line interface.

List of Features
  • Convenient character and word classes, like Thai consonants (pythainlp.thai_consonants), vowels (pythainlp.thai_vowels), digits (pythainlp.thai_digits), and stop words (pythainlp.corpus.thai_stopwords) -- comparable to constants like string.letters, string.digits, and string.punctuation
  • Thai linguistic unit segmentation/tokenization, including sentence (sent_tokenize), word (word_tokenize), and subword segmentations based on Thai Character Cluster (subword_tokenize)
  • Thai part-of-speech tagging (pos_tag)
  • Thai spelling suggestion and correction (spell and correct)
  • Thai transliteration (transliterate)
  • Thai soundex (soundex) with three engines (lk82, udom83, metasound)
  • Thai collation (sort by dictionary order) (collate)
  • Read out number to Thai words (bahttext, num_to_thaiword)
  • Thai datetime formatting (thai_strftime)
  • Thai-English keyboard misswitched fix (eng_to_thai, thai_to_eng)
  • Command-line interface for basic functions, like tokenization and pos tagging (run thainlp in your shell)

Installation

pip install --upgrade pythainlp

This will install the latest stable release of PyThaiNLP.

Install different releases:

  • Stable release:
    pip install --upgrade pythainlp
  • Pre-release (near ready):
    pip install --upgrade --pre pythainlp
  • Development (likely to break things):
    pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip

Installation Options

Some functionalities, like Thai WordNet, may require extra packages. To install those requirements, specify a set of

[name]
immediately after
pythainlp
:
pip install pythainlp[extra1,extra2,...]
List of possible `extras`
  • full (install everything)
  • attacut (to support attacut, a fast and accurate tokenizer)
  • benchmarks (for word tokenization benchmarking)
  • icu (for ICU, International Components for Unicode, support in transliteration and tokenization)
  • ipa (for IPA, International Phonetic Alphabet, support in transliteration)
  • ml (to support ULMFiT models for classification)
  • thai2fit (for Thai word vector)
  • thai2rom (for machine-learnt romanization)
  • wordnet (for Thai WordNet API)

For dependency details, look at

extras
variable in
setup.py
.

Data directory

  • Some additional data, like word lists and language models, may get automatically download during runtime.
  • PyThaiNLP caches these data under the directory
    ~/pythainlp-data
    by default.
  • Data directory can be changed by specifying the environment variable
    PYTHAINLP_DATA_DIR
    .
  • See the data catalog (
    db.json
    ) at https://github.com/PyThaiNLP/pythainlp-corpus

Command-Line Interface

Some of PyThaiNLP functionalities can be used at command line, using

thainlp
command.

For example, displaying a catalog of datasets:

sh
thainlp data catalog

Showing how to use:

sh
thainlp help

Licenses

| | License | |:---|:----| | PyThaiNLP Source Code and Notebooks | Apache Software License 2.0 | | Corpora, datasets, and documentations created by PyThaiNLP | Creative Commons Zero 1.0 Universal Public Domain Dedication License (CC0)| | Language models created by PyThaiNLP | Creative Commons Attribution 4.0 International Public License (CC-by) | | Other corpora and models that may included with PyThaiNLP | See Corpus License |

Contribute to PyThaiNLP

  • Please do fork and create a pull request :)
  • For style guide and other information, including references to algorithms we use, please refer to our contributing page.

Citations

If you use

PyThaiNLP
in your project or publication, please cite the library as follows
Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, & Pattarawat Chormai. (2016, Jun 27). PyThaiNLP: Thai Natural Language Processing in Python. Zenodo. http://doi.org/10.5281/zenodo.3519354

or BibTeX entry:

@misc{pythainlp,
    author       = {Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, Pattarawat Chormai},
    title        = {{PyThaiNLP: Thai Natural Language Processing in Python}},
    month        = Jun,
    year         = 2016,
    doi          = {10.5281/zenodo.3519354},
    publisher    = {Zenodo},
    url          = {http://doi.org/10.5281/zenodo.3519354}
}

Sponsors

VISTEC-depa Thailand Artificial Intelligence Research Institute

Since 2019, our contributors Korakot Chaovavanich and Lalita Lowphansirikul have been supported by VISTEC-depa Thailand Artificial Intelligence Research Institute.


Made with ❤️ | PyThaiNLP Team 💻 | "We build Thai NLP" 🇹🇭

We have only one official repository at https://github.com/PyThaiNLP/pythainlp and another mirror at https://gitlab.com/pythainlp/pythainlp
Beware of malware if you use code from mirrors other than the official two at GitHub and GitLab.

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.