skift

by shaypal5

shaypal5 / skift

scikit-learn wrappers for Python fastText.

203 Stars 22 Forks Last release: Not found Other 127 Commits 18 Releases

Available items

No Items, yet!

The developer of this repository has not created any items for sale yet. Need a bug fixed? Help with integration? A different license? Create a request here:

skift |skift_icon|

|PyPI-Status| |Downloads| |PyPI-Versions| |Build-Status| |Codecov| |Codefactor| |LICENCE|

.. |skift_icon| image:: https://github.com/shaypal5/skift/blob/be1f8e84d311f926fd39e8ea421525782b4cb39f/skift.png

scikit-learn
wrappers for Python
fastText
.

.. code-block:: python

from skift import FirstColFtClassifier df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl']) skclf = FirstColFtClassifier(lr=0.3, epoch=10) skclf.fit(df[['txt']], df['lbl']) sk_clf.predict([['woof']]) [0]

.. contents::

.. section-numbering::

Installation

Dependencies:

  • numpy
  • scipy
  • scikit-learn
  • The
    fasttext
    Python package

.. code-block:: bash

pip install skift

Configuration

Because

fasttext
reads input data from files,
skift
has to dump the input data into temporary files for
fasttext
to use. A dedicated folder is created for those files on the filesystem. By default, this storage is allocated in the system temporary storage location (i.e. /tmp on *nix systems). To override this default location, use the
SKIFT_TEMP_DIR
environment variable:

.. code-block:: bash

export SKIFTTEMPDIR=/path/to/desired/temp/folder

NOTE: The directory will be created if it does not already exist.

Features

  • Adheres to the
    scikit-learn
    classifier API, including
    predict_proba
    .
  • Also caters to the common use case of
    pandas.DataFrame
    inputs.
  • Enables easy stacking of
    fastText
    with other types of
    scikit-learn
    -compliant classifiers.
  • Pickle-able classifier objects.
  • Built around the
    official fasttext Python package 
    _.
  • Pure python.
  • Supports Python 3.5+.
  • Fully tested on Linux, OSX and Windows operating systems 
    _.

Wrappers

fastText
works only on text data, which means that it will only use a single column from a dataset which might contain many feature columns of different types. As such, a common use case is to have the
fastText
classifier use a single column as input, ignoring other columns. This is especially true when
fastText
is to be used as one of several classifiers in a stacking classifier, with other classifiers using non-textual features.

skift
includes several
scikit-learn
-compatible wrappers (for the
official 
_
fastText
Python package) which cater to these use cases.

NOTICE: Any additional keyword arguments provided to the classifier constructor, besides those required, will be forwarded to the

fastText.train_supervised
method on every call to
fit
.

Standard wrappers

These wrappers do not make additional assumptions on input besides those commonly made by

scikit-learn
classifies; i.e. that input is a 2d
ndarray
object and such.
  • FirstColFtClassifier
    - An sklearn classifier adapter for fasttext that takes the first column of input
    ndarray
    objects as input.

.. code-block:: python

from skift import FirstColFtClassifier df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl']) skclf = FirstColFtClassifier(lr=0.3, epoch=10) skclf.fit(df[['txt']], df['lbl']) sk_clf.predict([['woof']]) [0]

  • IdxBasedFtClassifier
    - An sklearn classifier adapter for fasttext that takes input by column index. This is set on object construction by providing the
    input_ix
    parameter to the constructor.

.. code-block:: python

from skift import IdxBasedFtClassifier df = pandas.DataFrame([[5, 'woof', 0], [83, 'meow', 1]], columns=['count', 'txt', 'lbl']) skclf = IdxBasedFtClassifier(inputix=1, lr=0.4, epoch=6) skclf.fit(df[['count', 'txt']], df['lbl']) skclf.predict([['woof']]) [0]

pandas-dependent wrappers

These wrappers assume the

X
parameter given to
fit
,
predict
, and
predict_proba
methods is a
pandas.DataFrame
object:
  • FirstObjFtClassifier
    - An sklearn adapter for fasttext using the first column of
    dtype == object
    as input.

.. code-block:: python

from skift import FirstObjFtClassifier df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl']) skclf = FirstObjFtClassifier(lr=0.2) skclf.fit(df[['txt']], df['lbl']) sk_clf.predict([['woof']]) [0]

  • ColLblBasedFtClassifier
    - An sklearn adapter for fasttext taking input by column label. This is set on object construction by providing the
    input_col_lbl
    parameter to the constructor.

.. code-block:: python

from skift import ColLblBasedFtClassifier df = pandas.DataFrame([['woof', 0], ['meow', 1]], columns=['txt', 'lbl']) skclf = ColLblBasedFtClassifier(inputcollbl='txt', epoch=8) skclf.fit(df[['txt']], df['lbl']) sk_clf.predict([['woof']]) [0]

Contributing

Package author and current maintainer is Shay Palachy ([email protected]); You are more than welcome to approach him for help. Contributions are very welcomed.

Installing for development

Clone:

.. code-block:: bash

git clone [email protected]:shaypal5/skift.git

Install in development mode, including test dependencies:

.. code-block:: bash

cd skift pip install -e '.[test]'

To also install

fasttext
, see instructions in the Installation section.

Running the tests

To run the tests use:

.. code-block:: bash

cd skift pytest

Adding documentation

The project is documented using the

numpy docstring conventions
, which were chosen as they are perhaps the most widely-spread conventions that are both supported by common tools such as Sphinx and result in human-readable docstrings. When documenting code you add to this project, follow
these conventions
.

..

numpy docstring conventions
: https://github.com/numpy/numpy/blob/master/doc/HOWTODOCUMENT.rst.txt ..
these conventions
: https://github.com/numpy/numpy/blob/master/doc/HOWTO
DOCUMENT.rst.txt

Additionally, if you update this

README.rst
file, use
python setup.py checkdocs
to validate it compiles.

Credits

Created by Shay Palachy ([email protected]).

Fixes:

uniaz 
,
crouffer 
,
amirzamli 
_ and
sgt 
_.

.. |PyPI-Status| image:: https://img.shields.io/pypi/v/skift.svg :target: https://pypi.python.org/pypi/skift

.. |PyPI-Versions| image:: https://img.shields.io/pypi/pyversions/skift.svg :target: https://pypi.python.org/pypi/skift

.. |Build-Status| image:: https://travis-ci.org/shaypal5/skift.svg?branch=master :target: https://travis-ci.org/shaypal5/skift

.. |LICENCE| image:: https://github.com/shaypal5/skift/blob/master/mitlicensebadge.svg :target: https://github.com/shaypal5/skift/blob/master/LICENSE

.. https://img.shields.io/github/license/shaypal5/skift.svg

.. |Codecov| image:: https://codecov.io/github/shaypal5/skift/coverage.svg?branch=master :target: https://codecov.io/github/shaypal5/skift?branch=master

.. |Downloads| image:: https://pepy.tech/badge/skift :target: https://pepy.tech/project/skift :alt: PePy stats

.. |Codefactor| image:: https://www.codefactor.io/repository/github/shaypal5/skift/badge?style=plastic :target: https://www.codefactor.io/repository/github/shaypal5/skift :alt: Codefactor code quality

.. Trigerring Travis builds

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.