Need help with cherry?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

Windsooon
460 Stars 36 Forks MIT License 322 Commits 2 Opened issues

Description

text classification - no machine learning knowledge needed

Services available

!
?

Need anything else?

Contributors list

Cherry

logo

image image image image

Cherry - Text classification with no machine learning knowledge needed

| Cherry | Windson | | ---- | ---- | | Download | https://pypi.python.org/pypi/cherry | | Source | https://github.com/Windsooon/cherry | | Keywords | text classification |

Document

Feature

Easy to use, no machine learning knowledge needed

Text classification in 5 minutes, no machine learning knowledge needed. We also provide extra features for users who want to improve their model.

Three built in models to play with.

Cherry has three built in English models, you can use them before use your dataset. For more info, checkout Built in model.

Easy to optimize and optimize performance

Cherry provide performence() and display() api to help you debug and improve the model.

Requirements

- Python (3.6, 3.7, 3.8)

Installation

Install using

pip
pip install cherry
# Cherry use nltk for text tokenizer 
pip install nltk
# After install nltk, You need to download punkt
>>> import nltk
>>> nltk.download('punkt')

or clone the project from github.

git clone [email protected]:Windsooon/cherry.git

Built in model

  • The 20 Newsgroups dataset

    These datasets contain 11,315 news. they were organized into 20 different newsgroups, each corresponding to one of the below topic:

    - alt.atheism, comp.graphics, comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware
    - comp.sys.mac.hardware, comp.windows.x, misc.forsale, rec.autos
    - rec.motorcycles, rec.sport.baseball, rec.sport.hockey, sci.crypt
    - sci.electronics, sci.med, sci.space, soc.religion.christian
    - talk.politics.guns, talk.politics.mideast, talk.politics.misc, talk.religion.misc
    
  • Comics & Graphic book review

    These datasets contain 108,463 reviews from the Goodreads book review website, Every book review also has rating from 0 point to 5 points.

  • SMS Spam Collection

    These datasets contain 5,578 SMS messages manually extracted from the Grumbletext Web site and randomly chosen ham messages of the NUS SMS Corpus (NSC).

Example

Use built-in model

Cherry has three built in models,

email
,
review
and
newsgroups
. For instance, in the Comics & Graphic book review datasets, every book review also has rating from 0 point to 5 points. If you want to predict rating based on the book review:

This is an extremely entertaining and often insightful collection by Nobel physicist Richard Feynman drawn from slices of his life experiences. Some might believe that the telling of a physicist’s life would be droll fare for anyone other than a fellow scientist, but in this instance, nothing could be further from the truth.

After finish Installation, in your project path run

cherry.train('review')
# You only need to run this line of code at the first time.
# This line of code will:
# 1. Download `review` datasets from remote server (User in China may need use VPN)
# 2. Train datasets using default settings ([Countvectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [MultinomialNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html))
>>> cherry.train('review')

Then you can use

classify()
to predict the rating now.
>>> res = cherry.classify('review', text='This is an extremely entertaining and often
insightful collection by Nobel physicist Richard Feynman drawn from slices of his life
experiences. Some might believe that the telling of a physicist’s life would be droll
fare for anyone other than a fellow scientist, but in this instance, nothing could be
further from the truth.')

The return

res
is a Classify object has two built-in method.
get_probability()
will return an array contains the probability of each category. The order of the return array depend on category name, in this case would be 0, 1, 2, 3, 4.
# The probability of this review had been rating as 4 points is 99.6%
>>> res.get_probability()
array([[6.99908424e-11, 2.48677319e-11, 6.17978214e-06, 3.39472694e-03,
    9.96313288e-01, 2.85805135e-04]])

Another method

get_word_list()
return a list that contains words that Cherry use for classifying.
>>> res.get_word_list()
[[(2, 'physicist'), (2, 'life'), (1, 'truth'), (1, 'telling'), (1, 'slices'), (1, 'scientist'), (1, 'richard'), (1, 'nobel'), (1, 'instance'), (1, 'insightful'), (1, 'feynman'), (1, 'fellow'), (1, 'fare'), (1, 'extremely'), (1, 'experiences'), (1, 'entertaining'), (1, 'droll'), (1, 'drawn'), (1, 'collection'), (1, 'believe')]]

As you can see, some of the words in the review didin't show up here. There are two reasons for this 1) The training data didn't contain that word. For instance, The word

Backend
and
Engineer
never show up in training data. 2) the word is a stop word. 'you', 'your' are stop words by default, you can find all stop words Cherry use at here.

Use your own data

Create a folder

your_model_name
under datasets in project path
├── project path
│   ├── datasets
|   │   ├── your_model_name
|   │   │   ├── category1
|   |   │     ├── file_1
|   |   │     ├── file_2
|   |   │     ├── …
|   │   │   ├── category2
|   |   │     ├── file_10
|   |   │     ├── file_11
|   |   │     ├── …

Train you datasets:

# By default, encoding will be utf-8,
# You only need to run `train` at the first time
>>> cherry.train('your_model_name', encoding='your_encoding')
# Classify text, `text` can be a list of text too.
>>> res = cherry.classify('your_model_name', text='text to be classified')

Quick Start

Let's build an email classifier from sketch, cherry will use this model to predict an email is spam or not.

Project setup

mkdir tutorial
cd tutorial

Create a virtual environment to isolate our package dependencies locally

python3 -m venv env source env/bin/activate # On Windows use env\Scripts\activate

Install cherry and nltk

pip install cherry pip install nltk >>> import nltk >>> nltk.download('punkt')

Create a new folder for email dataset

mkdir -p datasets/email_tutorial

Prepare dataset

  1. Download the datasets from SMS Spam Collection v. 1 then unzip it and put it inside
    tutorial/datasets/email_tutorial
    folder, now you got a file named
    SMSSpamCollection.txt
    which contains lots of emails.
  2. Create a folder name
    ham
    and
    spam
    inside
    email_tutorial
    dir.
  3. Create a script

    email.py
    in the same folder using code below to extract the email content and group them by category. every file would only contain text.
    import os
    import json
    
    

    ham_counter = 0 spam_counter = 0

    with open('SMSSpamCollection.txt', 'r') as f: for line in f.readlines(): if line.startswith('ham'): ham_counter += 1 with open(os.path.join('ham', str(ham_counter)), 'w') as nf: _, text = line.split('ham', 1) nf.write(text.strip()) else: spam_counter += 1 with open(os.path.join('spam', str(spam_counter)), 'w') as nf: _, text = line.split('spam', 1) nf.write(text.strip())

  4. Now your folder structure should look like this:

    tutorial
       ├── dataset
       │   ├── email_tutorial
       |   |   ├── email.py
       |   |   ├── SMSSpamCollection.txt
       │   │   ├── ham
       │   │   ├── spam
    
  5. Run

    python email.py
  6. Delete

    SMSSpamCollection.txt
    and
    email.py
  7. Back to the path of

    tutorial
    , Like
    cd path_to/tutorial
  8. Train the email model:

    >>> import cherry
    >>> cherry.train('email_tutorial', encoding='latin1')
    
  9. Inside

    email_tutorial
    folder you can find
    clf.pkz
    ,
    ve.pkz
    ,
    email_tutorial.pkz
    which Cherry will use them for classify later.
     >>> res = cherry.classify('email_tutorial', 'Thank you for your interest in cherry! We wanted to let you'
          'know we received your application for Backend Engineer, and we are delighted that you'
          'would consider joining our team.')
     # 99.9% is a ham email
     >>> res.get_probability()
     array([[9.99985571e-01, 1.44288379e-05]])
     >>> res.get_word_list()
     [[(1, 'wanted'), (1, 'thank'), (1, 'team'), (1, 'received'), (1, 'let'),
     (1, 'joining'), (1, 'consider'), (1, 'application')]]
    
  10. If you want to know good your model did, you can use performance() which will use k-fold cross validation (By default, K equals to 10):

     >>> res = cherry.performance('email_tutorial', encoding='latin1', output='files')
     >>> res.get_score()
    

The report will be save in

report
files, you can find the precision, recall, and f1-score.
                 precision    recall  f1-score   support

       0       0.99      1.00      0.99       485
       1       0.97      0.95      0.96        73

accuracy                           0.99       558

macro avg 0.98 0.97 0.98 558 weighted avg 0.99 0.99 0.99 558

If you want to know which text had been clasiify wrong:

     >>> res = cherry.performance('email_tutorial', encoding='latin1')
     >>> res.get_score()
     Text: Dhoni have luck to win some big title.so we will win:) has been classified as: 1 should be: 0
     Text: Back 2 work 2morro half term over! Can U C me 2nite 4 some sexy passion B4 I have 2 go back? Chat NOW 09099726481 Luv DENA Calls £1/minMobsmoreLKPOBOX177HP51FL has been classified as: 0 should be: 1
     Text: Latest News! Police station toilet stolen, cops have nothing to go on! has been classified as: 0 should be: 1
     ...
  1. To display the graph, you can use

    >>> res.display('email_tutorial', encoding='latin1')
    

img

  1. If you want to improve your model, you can use search method.

    >>> parameters = {'clf__alpha': [0.1, 0.5, 1],'clf__fit_prior': [True, False]}
    >>> cherry.search('email_tutorial', parameters)
    

API

def train(model, language='English', preprocessing=None, categories=None, encoding='utf-8', vectorizer=None, vectorizer_method='Count', clf=None, clf_method='MNB', xdata=None, ydata=None)

  • model (String)

    The name of the model, you can use build-in models

    email
    ,
    review
    and
    newsgroups
    , or pass the folder name of your dataset.
  • language (String)

    The language of the training dataset. Cherry supports

    English
    and
    Chinese
    .
  • preprocessing (function)

    The function will be called once for every input data before training.

  • categories (List)

    Specify the training directory, for instance ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc'].

  • encoding (String)

    The encoding of the dataset.

  • vectorizer (Sklearn object)

    Feature extraction function use to convert the data into vertcor,by default is

    CountVectorizer()
    . you can pass different feature extraction function from Sklearn.

For some long texts you can use

TfidfVectorizer()
,If you need to save memory you can use
HashingVectorizer()
, (get_word_list() function wouldn't work at this case)
  • vectorizer_method (String)

    Cherry supports shortcut to set up feature extraction function when

    vectorizer
    is
    None
    .
    Count
    corresponds to
    CountVectorizer(tokenizer=tokenizer, stop_words=get_stop_words(model))
    ,
    Tfidf
    corresponds to
    TfidfVectorizer
    and
    Hashing
    corresponds to
    HashingVectorizer
    .
  • clf (Sklearn object)

    Classify function, by default is

    MultinomialNB()
    . You can pass classify function from Sklearn.
  • clf_method (String)

    Cherry supports shortcut to set up classify function when

    clf
    is
    None
    ,
    MNB
    corresponds to
    MultinomialNB(alpha=0.1)
    ,
    SGD
    corresponds to
    SGDClassifier
    ,
    RandomForest
    corresponds to
    RandomForestClassifier
    ,
    AdaBoost
    corresponds to
    AdaBoostClassifier
    .
  • x_data (numpy array)

    training text data, if

    x_data
    and
    y_data
    is None, cherry will try to find the text files data in
    model
  • y_data (numpy array)

    correspond labels data, if

    x_data
    and
    y_data
    is None, cherry will try to find the text files data in
    model

def classify(model, text)

  • model (String)

    The name of the model, you can use build-in models

    email
    ,
    review
    and
    newsgroups
    , or pass the folder name of your dataset.
  • text (List / String)

    the text to be classify.

def performance(model, language='English', preprocessing=None, categories=None, encoding='utf-8', vectorizer=None, vectorizer_method='Count', clf=None, clf_method='MNB', x_data=None, y_data=None, n_splits=10, output='Stdout')

Just as same as

train()
API
  • n_splits (Integer)

    number of folds. Must be at least 2.

  • output ('Stdout' or 'Files')

    'Stdout' will print the scores to standerd output and 'Files' will store the scores into a local file named 'report'.

def search(model, parameters, language='English', preprocessing=None, categories=None, encoding='utf-8', vectorizer=None, vectorizer_method='Count', clf=None, clf_method='MNB', x_data=None, y_data=None, method='RandomizedSearchCV', cv=3, n_jobs=-1):

TODO

def display(model, language='English', preprocessing=None, categories=None, encoding='utf-8', vectorizer=None, vectorizer_method='Count', clf=None, clf_method='MNB', x_data=None, y_data=None)

Just as same as

train()
API

Tests

>>> python runtests.py

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.