by raghakot

raghakot / keras-text

Text Classification Library in Keras

423 Stars 101 Forks Last release: Not found MIT License 16 Commits 1 Releases

Available items

No Items, yet!

The developer of this repository has not created any items for sale yet. Need a bug fixed? Help with integration? A different license? Create a request here:

Keras Text Classification Library

Build Status license Slack

keras-text is a one-stop text classification library implementing various state of the art models with a clean and extendable interface to implement custom architectures.

Quick start

Create a tokenizer to build your vocabulary

  • To represent you dataset as
    (docs, words)
  • To represent you dataset as
    (docs, sentences, words)
  • To create arbitrary hierarchies, extend
    and implement the
from keras_text.processing import WordTokenizer

tokenizer = WordTokenizer() tokenizer.build_vocab(texts)

Want to tokenize with character tokens to leverage character models? Use


Build a dataset

A dataset encapsulates tokenizer, X, y and the test set. This allows you to focus your efforts on trying various architectures/hyperparameters without having to worry about inconsistent evaluation. A dataset can be saved and loaded from the disk.

from import Dataset

ds = Dataset(X, y, tokenizer=tokenizer) ds.update_test_indices(test_size=0.1)'dataset')


method automatically stratifies multi-class or multi-label data correctly.

Build text classification models

See tests/ folder for usage.

Word based models

When dataset represented as

(docs, words)
word based models can be created using
from keras_text.models import TokenModelFactory
from keras_text.models import YoonKimCNN, AttentionRNN, StackedRNN

RNN models can use max_tokens=None to indicate variable length words per mini-batch.

factory = TokenModelFactory(1, tokenizer.token_index, max_tokens=100, embedding_type='glove.6B.100d') word_encoder_model = YoonKimCNN() model = factory.build_model(token_encoder_model=word_encoder_model) model.compile(optimizer='adam', loss='categorical_crossentropy') model.summary()

Currently supported models include:

  • Yoon Kim CNN
  • Stacked RNNs
  • Attention (with/without context) based RNN encoders.

uses the provided word encoder which is then classified via

Sentence based models

When dataset represented as

(docs, sentences, words)
sentence based models can be created using
from keras_text.models import SentenceModelFactory
from keras_text.models import YoonKimCNN, AttentionRNN, StackedRNN, AveragingEncoder

Pad max sentences per doc to 500 and max words per sentence to 200.

Can also use max_sents=None to allow variable sized max_sents per mini-batch.

factory = SentenceModelFactory(10, tokenizer.token_index, max_sents=500, max_tokens=200, embedding_type='glove.6B.100d') word_encoder_model = AttentionRNN() sentence_encoder_model = AttentionRNN()

Allows you to compose arbitrary word encoders followed by sentence encoder.

model = factory.build_model(word_encoder_model, sentence_encoder_model) model.compile(optimizer='adam', loss='categorical_crossentropy') model.summary()

Currently supported models include:

  • Yoon Kim CNN
  • Stacked RNNs
  • Attention (with/without context) based RNN encoders.

created a tiered model where words within a sentence is first encoded using
. All such encodings per sentence is then encoded using
  • Hierarchical attention networks (HANs) can be build by composing two attention based RNN models. This is useful when a document is very large.
  • For smaller document a reasonable way to encode sentences is to average words within it. This can be done by using
  • Mix and match encoders as you see fit for your problem.


TODO: Update documentation and add notebook examples.

Stay tuned for better documentation and examples. Until then, the best resource is to refer to the API docs


1) Install keras with theano or tensorflow backend. Note that this library requires Keras > 2.0

2) Install keras-text

From sources

sudo python install

PyPI package

sudo pip install keras-text

3) Download target spacy model

keras-text uses the excellent spacy library for tokenization. See instructions on how to download model for target language.


Please cite keras-text in your publications if it helped your research. Here is an example BibTeX entry:

  author={Kotikalapudi, Raghavendra and contributors},

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.