Need help with bnlp?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

sagorbrur
129 Stars 32 Forks MIT License 293 Commits 0 Opened issues

Description

BNLP is a natural language processing toolkit for Bengali Language.

Services available

!
?

Need anything else?

Contributors list

# 35,871
Groovy
opensta...
SQL
sre
272 commits
# 435,867
Jupyter...
Python
ner
named-e...
1 commit

bnlp

Bengali Natural Language Processing(BNLP)

Build Status PyPI version release version Support Python Version Documentation Status Gitter

BNLP is a natural language processing toolkit for Bengali Language. This tool will help you to tokenize Bengali text, Embedding Bengali words, Bengali POS Tagging, Bengali Name Entity Recognition, Construct Neural Model for Bengali NLP purposes.

Installation

PIP installer(Python: 3.5, 3.6, 3.7, 3.8 tested okay, OS: linux, windows tested okay )

  pip install bnlp_toolkit

or Upgrade

  pip install -U bnlp_toolkit

Pretrained Model

Download Link

Training Details

  • Sentencepiece, Word2Vec, Fasttext, GloVe model trained with Bengali Wikipedia Dump Dataset
  • SentencePiece Training Vocab Size=50000
  • Fasttext trained with total words = 20M, vocab size = 1171011, epoch=50, embedding dimension = 300 and the training loss = 0.318668,
  • Word2Vec word embedding dimension = 300
  • To Know Bengali GloVe Wordvector and training process follow this repository
  • Bengali CRF POS Tagging was training with nltr dataset with 80% accuracy.
  • Bengali CRF NER Tagging was train with this data with 90% accuracy.

Tokenization

  • Basic Tokenizer
  from bnlp import BasicTokenizer
  basic_tokenizer = BasicTokenizer()
  raw_text = "আমি বাংলায় গান গাই।"
  tokens = basic_tokenizer.tokenize(raw_text)
  print(tokens)

output: ["আমি", "বাংলায়", "গান", "গাই", "।"]

  • NLTK Tokenization
  from bnlp import NLTKTokenizer

bnltk = NLTKTokenizer() text = "আমি ভাত খাই। সে বাজারে যায়। তিনি কি সত্যিই ভালো মানুষ?" word_tokens = bnltk.word_tokenize(text) sentence_tokens = bnltk.sentence_tokenize(text) print(word_tokens) print(sentence_tokens)

output

word_token: ["আমি", "ভাত", "খাই", "।", "সে", "বাজারে", "যায়", "।", "তিনি", "কি", "সত্যিই", "ভালো", "মানুষ", "?"]

sentence_token: ["আমি ভাত খাই।", "সে বাজারে যায়।", "তিনি কি সত্যিই ভালো মানুষ?"]

  • Bengali SentencePiece Tokenization

    • tokenization using trained model ```py from bnlp import SentencepieceTokenizer

    bsp = SentencepieceTokenizer() modelpath = "./model/bnspm.model" inputtext = "আমি ভাত খাই। সে বাজারে যায়।" tokens = bsp.tokenize(modelpath, inputtext) print(tokens) text2id = bsp.text2id(modelpath, inputtext) print(text2id) id2text = bsp.id2text(modelpath, text2id) print(id2text)

    - Training SentencePiece
    ```py
    from bnlp import SentencepieceTokenizer
    
    

    bsp = SentencepieceTokenizer() data = "raw_text.txt" model_prefix = "test" vocab_size = 5 bsp.train(data, model_prefix, vocab_size)

Word Embedding

  • Bengali Word2Vec

    • Generate Vector using pretrain model
    from bnlp import BengaliWord2Vec
    
    

    bwv = BengaliWord2Vec() model_path = "bengali_word2vec.model" word = 'আমার' vector = bwv.generate_word_vector(model_path, word) print(vector.shape) print(vector)

    • Find Most Similar Word Using Pretrained Model
    from bnlp import BengaliWord2Vec
    
    

    bwv = BengaliWord2Vec() model_path = "bengali_word2vec.model" word = 'গ্রাম' similar = bwv.most_similar(model_path, word) print(similar)

    • Train Bengali Word2Vec with your own data
    from bnlp import BengaliWord2Vec
    bwv = BengaliWord2Vec()
    data_file = "raw_text.txt"
    model_name = "test_model.model"
    vector_name = "test_vector.vector"
    bwv.train(data_file, model_name, vector_name)
    
    

    • Bengali FastText

    To use

    fasttext
    you need to install fasttext manually by
    pip install fasttext==0.9.2

    NB:

    fasttext
    may not be worked in
    windows
    , it will only work in
    linux
    - Generate Vector Using Pretrained Model
      from bnlp.embedding.fasttext import BengaliFasttext
    
    

    bft = BengaliFasttext() word = "গ্রাম" model_path = "bengali_fasttext_wiki.bin" word_vector = bft.generate_word_vector(model_path, word) print(word_vector.shape) print(word_vector)

    • Train Bengali FastText Model
      from bnlp.embedding.fasttext import BengaliFasttext
    
    

    bft = BengaliFasttext() data = "raw_text.txt" model_name = "saved_model.bin" epoch = 50 bft.train(data, model_name, epoch)

  • Bengali GloVe Word Vectors

We trained glove model with bengali data(wiki+news articles) and published bengali glove word vectors You can download and use it on your different machine learning purposes.

  from bnlp import BengaliGlove
  glove_path = "bn_glove.39M.100d.txt"
  word = "গ্রাম"
  bng = BengaliGlove()
  res = bng.closest_word(glove_path, word)
  print(res)
  vec = bng.word2vec(glove_path, word)
  print(vec)

Bengali POS Tagging

  • Bengali CRF POS Tagging

    • Find Pos Tag Using Pretrained Model
    from bnlp import POS
    bn_pos = POS()
    model_path = "model/bn_pos.pkl"
    text = "আমি ভাত খাই।"
    res = bn_pos.tag(model_path, text)
    print(res)
    # [('আমি', 'PPR'), ('ভাত', 'NC'), ('খাই', 'VM'), ('।', 'PU')]
    
    

    • Train POS Tag Model
    from bnlp import POS
    bn_pos = POS()
    model_name = "pos_model.pkl"
    tagged_sentences = [[('রপ্তানি', 'JJ'), ('দ্রব্য', 'NC'), ('-', 'PU'), ('তাজা', 'JJ'), ('ও', 'CCD'), ('শুকনা', 'JJ'), ('ফল', 'NC'), (',', 'PU'), ('আফিম', 'NC'), (',', 'PU'), ('পশুচর্ম', 'NC'), ('ও', 'CCD'), ('পশম', 'NC'), ('এবং', 'CCD'),('কার্পেট', 'NC'), ('৷', 'PU')], [('মাটি', 'NC'), ('থেকে', 'PP'), ('বড়জোর', 'JQ'), ('চার', 'JQ'), ('পাঁচ', 'JQ'), ('ফুট', 'CCL'), ('উঁচু', 'JJ'), ('হবে', 'VM'), ('৷', 'PU')]]
    
    

    bn_pos.train(model_name, tagged_sentences)

Bengali NER

  • Bengali CRF NER

    • Find NER Tag Using Pretrained Model
    from bnlp import NER
    bn_ner = NER()
    model_path = "model/bn_ner.pkl"
    text = "সে ঢাকায় থাকে।"
    result = bn_ner.tag(model_path, text)
    print(result)
    # [('সে', 'O'), ('ঢাকায়', 'S-LOC'), ('থাকে', 'O')]
    
    

    • Train NER Tag Model
    from bnlp import NER
    bn_ner = NER()
    model_name = "ner_model.pkl"
    tagged_sentences = [[('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')], [('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')], [('ত্রাণ', 'O'),('ও', 'O'),('সমাজকল্যাণ', 'O'),('সম্পাদক', 'S-PER'),('সুজিত', 'B-PER'),('রায়', 'I-PER'),('নন্দী', 'E-PER'),('প্রমুখ', 'O'),('সংবাদ', 'O'),('সম্মেলনে', 'O'),('উপস্থিত', 'O'),('ছিলেন', 'O')]]
    
    

    bn_ner.train(model_name, tagged_sentences)

Bengali Corpus Class

  • Stopwords and Punctuations ```py from bnlp.corpus import stopwords, punctuations

stopwords = stopwords() print(stopwords) print(punctuations)

* Remove stopwords from Text

```py
from bnlp.corpus import stopwords
from bnlp.corpus.util import remove_stopwords

stopwords = stopwords()
raw_text = 'আমি ভাত খাই।' 
result = remove_stopwords(raw_text, stopwords)
print(result)
# ['ভাত', 'খাই', '।']
```

Contributor Guide

Check CONTRIBUTING.md page for details.

Thanks To

Contributor List

Extra Contributor

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.