Extract Keywords from sentence or Replace keywords in sentences.
=========
.. image:: https://api.travis-ci.org/vi3k6i5/flashtext.svg?branch=master :target: https://travis-ci.org/vi3k6i5/flashtext :alt: Build Status
.. image:: https://readthedocs.org/projects/flashtext/badge/?version=latest :target: http://flashtext.readthedocs.io/en/latest/?badge=latest :alt: Documentation Status
.. image:: https://badge.fury.io/py/flashtext.svg :target: https://badge.fury.io/py/flashtext :alt: Version
.. image:: https://coveralls.io/repos/github/vi3k6i5/flashtext/badge.svg?branch=master :target: https://coveralls.io/github/vi3k6i5/flashtext?branch=master :alt: Test coverage
.. image:: https://img.shields.io/github/license/mashape/apistatus.svg?maxAge=2592000 :target: https://github.com/vi3k6i5/flashtext/blob/master/LICENSE :alt: license
This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the
FlashText algorithm_.
::
$ pip install flashtext
Documentation can be found at
FlashText Read the Docs_.
Extract keywords >>> from flashtext import KeywordProcessor >>> keywordprocessor = KeywordProcessor() >>> # keywordprocessor.addkeyword(, ) >>> keywordprocessor.addkeyword('Big Apple', 'New York') >>> keywordprocessor.addkeyword('Bay Area') >>> keywordsfound = keywordprocessor.extractkeywords('I love Big Apple and Bay Area.') >>> keywords_found >>> # ['New York', 'Bay Area']
Replace keywords >>> keywordprocessor.addkeyword('New Delhi', 'NCR region') >>> newsentence = keywordprocessor.replacekeywords('I love Big Apple and new delhi.') >>> newsentence >>> # 'I love New York and NCR region.'
Case Sensitive example >>> from flashtext import KeywordProcessor >>> keywordprocessor = KeywordProcessor(casesensitive=True) >>> keywordprocessor.addkeyword('Big Apple', 'New York') >>> keywordprocessor.addkeyword('Bay Area') >>> keywordsfound = keywordprocessor.extractkeywords('I love big Apple and Bay Area.') >>> keywordsfound >>> # ['Bay Area']
Span of keywords extracted >>> from flashtext import KeywordProcessor >>> keywordprocessor = KeywordProcessor() >>> keywordprocessor.addkeyword('Big Apple', 'New York') >>> keywordprocessor.addkeyword('Bay Area') >>> keywordsfound = keywordprocessor.extractkeywords('I love big Apple and Bay Area.', spaninfo=True) >>> keywordsfound >>> # [('New York', 7, 16), ('Bay Area', 21, 29)]
Get Extra information with keywords extracted >>> from flashtext import KeywordProcessor >>> kp = KeywordProcessor() >>> kp.addkeyword('Taj Mahal', ('Monument', 'Taj Mahal')) >>> kp.addkeyword('Delhi', ('Location', 'Delhi')) >>> kp.extractkeywords('Taj Mahal is in Delhi.') >>> # [('Monument', 'Taj Mahal'), ('Location', 'Delhi')] >>> # NOTE: replacekeywords feature won't work with this.
No clean name for Keywords >>> from flashtext import KeywordProcessor >>> keywordprocessor = KeywordProcessor() >>> keywordprocessor.addkeyword('Big Apple') >>> keywordprocessor.addkeyword('Bay Area') >>> keywordsfound = keywordprocessor.extractkeywords('I love big Apple and Bay Area.') >>> keywords_found >>> # ['Big Apple', 'Bay Area']
Add Multiple Keywords simultaneously >>> from flashtext import KeywordProcessor >>> keywordprocessor = KeywordProcessor() >>> keyworddict = { >>> "java": ["java2e", "java programing"], >>> "product management": ["PM", "product manager"] >>> } >>> # {'cleanname': ['list of unclean names']} >>> keywordprocessor.addkeywordsfromdict(keyworddict) >>> # Or add keywords from a list: >>> keywordprocessor.addkeywordsfromlist(["java", "python"]) >>> keywordprocessor.extractkeywords('I am a product manager for a java2e platform') >>> # output ['product management', 'java']
To Remove keywords >>> from flashtext import KeywordProcessor >>> keywordprocessor = KeywordProcessor() >>> keyworddict = { >>> "java": ["java2e", "java programing"], >>> "product management": ["PM", "product manager"] >>> } >>> keywordprocessor.addkeywordsfromdict(keyworddict) >>> print(keywordprocessor.extractkeywords('I am a product manager for a java2e platform')) >>> # output ['product management', 'java'] >>> keywordprocessor.removekeyword('java2e') >>> # you can also remove keywords from a list/ dictionary >>> keywordprocessor.removekeywordsfromdict({"product management": ["PM"]}) >>> keywordprocessor.removekeywordsfromlist(["java programing"]) >>> keywordprocessor.extractkeywords('I am a product manager for a java_2e platform') >>> # output ['product management']
To check Number of terms in KeywordProcessor >>> from flashtext import KeywordProcessor >>> keywordprocessor = KeywordProcessor() >>> keyworddict = { >>> "java": ["java2e", "java programing"], >>> "product management": ["PM", "product manager"] >>> } >>> keywordprocessor.addkeywordsfromdict(keyworddict) >>> print(len(keyword_processor)) >>> # output 4
To check if term is present in KeywordProcessor >>> from flashtext import KeywordProcessor >>> keywordprocessor = KeywordProcessor() >>> keywordprocessor.addkeyword('j2ee', 'Java') >>> 'j2ee' in keywordprocessor >>> # output: True >>> keywordprocessor.getkeyword('j2ee') >>> # output: Java >>> keywordprocessor['colour'] = 'color' >>> keywordprocessor['colour'] >>> # output: color
Get all keywords in dictionary >>> from flashtext import KeywordProcessor >>> keywordprocessor = KeywordProcessor() >>> keywordprocessor.addkeyword('j2ee', 'Java') >>> keywordprocessor.addkeyword('colour', 'color') >>> keywordprocessor.getallkeywords() >>> # output: {'colour': 'color', 'j2ee': 'Java'}
For detecting Word Boundary currently any character other than this
\\w
[A-Za-z0-9_]is considered a word boundary.
To set or add characters as part of word characters >>> from flashtext import KeywordProcessor >>> keywordprocessor = KeywordProcessor() >>> keywordprocessor.addkeyword('Big Apple') >>> print(keywordprocessor.extractkeywords('I love Big Apple/Bay Area.')) >>> # ['Big Apple'] >>> keywordprocessor.addnonwordboundary('/') >>> print(keywordprocessor.extract_keywords('I love Big Apple/Bay Area.')) >>> # []
::
$ git clone https://github.com/vi3k6i5/flashtext $ cd flashtext $ pip install pytest $ python setup.py test
::
$ git clone https://github.com/vi3k6i5/flashtext $ cd flashtext/docs $ pip install sphinx $ make html $ # open _build/html/index.html in browser to view it locally
It's a custom algorithm based on
Aho-Corasick algorithm_ and
Trie Dictionary_.
.. image:: https://github.com/vi3k6i5/flashtext/raw/master/benchmark.png :target: https://twitter.com/RadimRehurek/status/904989624589803520 :alt: Benchmark
Time taken by FlashText to find terms in comparison to Regex.
.. image:: https://thepracticaldev.s3.amazonaws.com/i/xruf50n6z1r37ti8rd89.png
Time taken by FlashText to replace terms in comparison to Regex.
.. image:: https://thepracticaldev.s3.amazonaws.com/i/k44ghwp8o712dm58debj.png
Link to code for benchmarking the
Find Feature_ and
Replace Feature_.
The idea for this library came from the following
StackOverflow question_.
The original paper published on
FlashText algorithm_.
::
@ARTICLE{2017arXiv171100046S, author = {{Singh}, V.}, title = "{Replace or Retrieve Keywords In Documents at Scale}", journal = {ArXiv e-prints}, archivePrefix = "arXiv", eprint = {1711.00046}, primaryClass = "cs.DS", keywords = {Computer Science - Data Structures and Algorithms}, year = 2017, month = oct, adsurl = {http://adsabs.harvard.edu/abs/2017arXiv171100046S}, adsnote = {Provided by the SAO/NASA Astrophysics Data System} }
The article published on
Medium freeCodeCamp_.
The project is licensed under the MIT license.