Need help with pyate?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

kevinlu1248
158 Stars 21 Forks MIT License 122 Commits 7 Opened issues

Description

PYthon Automated Term Extraction

Services available

!
?

Need anything else?

Contributors list

# 60,543
cython
Sass
Shell
entity-...
78 commits
# 102,062
algolia...
cpluspl...
Dart
whatsap...
10 commits
# 2,818
unixpor...
Flutter
assembl...
spotify
2 commits
# 304,029
Python
geoscie...
monte-c...
TeX
2 commits
# 126,432
python3
whatsap...
jira
ml
1 commit
# 392,787
github-...
Jekyll
TypeScr...
gatsby
1 commit
# 49,810
naming
vuejs
esoteri...
esoteri...
1 commit

PYthon Automated Term Extraction

Build Status PyPI pyversions PyPI version fury.io Downloads Downloads Downloads HitCount Code style: black Built with spaCy License: MIT

Python implementation of term extraction algorithms such as C-Value, Basic, Combo Basic, Weirdness and Term Extractor using spaCy POS tagging.

If you have a suggestion for another ATE algorithm you would like implemented in this package feel free to file it as an issue with the paper the algorithm is based on.

For ATE packages implemented in Scala and Java, see ATR4S and JATE, respectively.

:tada: Installation

Using pip:

bash
pip install pyate https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz

Models

Though this model was originally intended for symbolic AI algorithms (non-machine learning), I realized a spaCy model on term extraction can reach significantly higher performance, and thus decided to include the model here.

For a comparison with the symbolic AI algorithms, see Precision. Note that only the F-Score, accuracy and precision was taken here yet for the model, but for the algorithms the AvP was taken so directly comparing the metrics would not really make sense.

| URL | F-Score (%) | Precision (%) | Recall (%) | | ------------- | ------------- | ------------- | ------------- | | https://github.com/kevinlu1248/pyate/releases/download/v0.4.2/enaclterms_sm-2.0.4.tar.gz | 94.71 | 95.41 | 94.03 |

The model was trained and evaluated on the ACL dataset, which is a computer science oriented dataset where the terms are manually picked. This has not yet been tested on other fields yet, however.

This model does not come with PyATE. To install, run

pip install https://github.com/kevinlu1248/pyate/releases/download/v0.4.2/en_acl_terms_sm-2.0.3.tar.gz

To extract terms,

import spacy

nlp = spacy.load("en_acl_terms_sm") doc = nlp("Hello world, I am a term extraction algorithm.") print(doc.ents) """ (term extraction, algorithm) """

:rocket: Quickstart

To get started, simply call one of the implemented algorithms. According to Astrakhantsev 2016,

combo_basic
is the most precise of the five algorithms, though
basic
and
cvalues
is not too far behind (see Precision). The same study shows that PU-ATR and KeyConceptRel have higher precision than
combo_basic
but are not implemented and PU-ATR take significantly more time since it uses machine learning. ```python3 from pyate import combo_basic

source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1994795/

string = """Central to the development of cancer are genetic changes that endow these “cancer cells” with many of the hallmarks of cancer, such as self-sufficient growth and resistance to anti-growth and pro-death signals. However, while the genetic changes that occur within cancer cells themselves, such as activated oncogenes or dysfunctional tumor suppressors, are responsible for many aspects of cancer development, they are not sufficient. Tumor promotion and progression are dependent on ancillary processes provided by cells of the tumor environment but that are not necessarily cancerous themselves. Inflammation has long been associated with the development of cancer. This review will discuss the reflexive relationship between cancer and inflammation with particular focus on how considering the role of inflammation in physiologic processes such as the maintenance of tissue homeostasis and repair may provide a logical framework for understanding the U connection between the inflammatory response and cancer."""

print(combobasic(string).sortvalues(ascending=False)) """ (Output) dysfunctional tumor 1.443147 tumor suppressors 1.443147 genetic changes 1.386294 cancer cells 1.386294 dysfunctional tumor suppressors 1.298612 logical framework 0.693147 sufficient growth 0.693147 death signals 0.693147 many aspects 0.693147 inflammatory response 0.693147 tumor promotion 0.693147 ancillary processes 0.693147 tumor environment 0.693147 reflexive relationship 0.693147 particular focus 0.693147 physiologic processes 0.693147 tissue homeostasis 0.693147 cancer development 0.693147 dtype: float64 """

If you would like to add this to a spacy pipeline, simply use add Spacy's `add_pipe` method.
python3 import spacy from pyate.termextractionpipeline import TermExtractionPipeline

nlp = spacy.load("encorewebsm") nlp.addpipe(TermExtractionPipeline()) doc = nlp(string) print(doc..combobasic.sortvalues(ascending=False).head(5)) """ (Output) dysfunctional tumor 1.443147 tumor suppressors 1.443147 genetic changes 1.386294 cancer cells 1.386294 dysfunctional tumor suppressors 1.298612 dtype: float64 """ ``

Also,
TermExtractionPipeline.init
is defined as follows
`` _
init_( self, func: Callable[..., pd.Series] = combobasic, args, *kwargs ) ``

where
func
is essentially your term extracting algorithm that takes in a corpus (either a string or iterator of strings) and outputs a Pandas Series of term-value pairs of terms and their respective termhoods.
func
is by default
combo_basic
.
args
and
kwargs
are for you to overide default values for the function, which you can find by running
help` (might document later on).

Summary of functions

Each of

cvalues, basic, combo_basic, weirdness
and
term_extractor
take in a string or an iterator of strings and outputs a Pandas Series of term-value pairs, where higher values indicate higher chance of being a domain specific term. Furthermore,
weirdness
and
term_extractor
take a
general_corpus
key word argument which must be an iterator of strings which defaults to the General Corpus described below.

All functions only take the string of which you would like to extract terms from as the mandatory input (the

technical_corpus
), as well as other tweakable settings, including
general_corpus
(contrasting corpus for
weirdness
and
term_extractor
),
general_corpus_size
,
verbose
(whether to print a progress bar),
weights
,
smoothing
,
have_single_word
(whether to have a single word count as a phrase) and
threshold
. If you have not read the papers and are unfamiliar with the algorithms, I recommend just using the default settings. Again, use
help
to find the details regarding each algorithm since they are all different.

General Corpus

Under

path/to/site-packages/pyate/default_general_domain.en.csv
, there is a general CSV file of a general corpus, specifically, 3000 random sentences from Wikipedia. The source of it can be found at https://www.kaggle.com/mikeortman/wikipedia-sentences. Access it using it using the following after installing
pyate
.
import pandas as pd
from distutils.sysconfig import get_python_lib  
df = pd.read_csv(get_python_lib() + "/pyate/default_general_domain.en.csv")["SECTION_TEXT"]
print(df.head())
""" (Output)
0    '''Anarchism''' is a political philosophy that...
1    The term ''anarchism'' is a compound word comp...
2    ===Origins===\nWoodcut from a Diggers document...
3    Portrait of philosopher Pierre-Joseph Proudhon...
4    consistent with anarchist values is a controve...
Name: SECTION_TEXT, dtype: object
"""

Other Languages

For switching languages, simply run

Term_Extraction.set_language({language}, {model_name})
, where
model_name
defaults to
language
. For example,
Term_Extraction.set_language("it", "it_core_news_sm"})
for Italian. By default, the language is English. So far, only English (en) and Italian (it) are supported.

To add more languages, file an issue with a corpus of at least 3000 paragraphs of a general domain in the desired language (preferably wikipedia) named

default_general_domain.{lang}.csv
replacing lang with the ISO-639-1 code of the language, or the ISO-639-2 if the language does not have a ISO-639-1 code (can be found at https://www.loc.gov/standards/iso639-2/php/codelist.php). The file format should be of the following form to be parsable by Pandas. ``` ,SECTIONTEXT 0,"{paragraph0}" 1,"{paragraph1}" ... ```

Alternatively, place the file in

src/pyate
and file a pull request.

:dart: Precision

Here is the average precision of some of the implemented algorithms using the Average Precision (AvP) metric on seven distinct databases, as tested in Astrakhantsev 2016. Evaluation

:stars: Motivation

This project was planned to be a tool to be connected to a Google Chrome Extension that highlights and defines key terms that the reader probably does not know of. Furthermore, term extraction is an area where there is not a lot of focused research on in comparison to other areas of NLP and especially recently is not viewed to be very practical due to the more general tool of NER tagging. However, modern NER tagging usually incorporates some combination of memorized words and deep learning which are spatially and computationally heavy. Furthermore, to generalize an algorithm to recognize terms to the ever growing areas of medical and AI research, a list of memorized words will not do.

Of the five implemented algorithms, none are expensive, in fact, the bottleneck of the space allocation and computation expense is from the spaCy model and spaCy POS tagging. This is because they mostly rely simply on POS patterns, word frequencies, and the existence of embedded term candidates. For example, the term candidate "breast cancer" implies that "malignant breast cancer" is probably not a term and simply a form of "breast cancer" that is "malignant" (implemented in C-Value).

:pushpin: Todo

  • Add other languages and data encapsulation for set language
  • Add automated tests and CI/CD
  • Add a brief CLI
  • Make NER version of this using the datasets from the sources
  • Add PU-ATR algorithm since its precision is a lot higher, though more computationally expensive
  • Page Rank algorithm
  • Add sources
  • Add voting algorithm and capabilities
  • Optimize perhaps using Cython, however, the bottleneck is POS tagging by Spacy and word counting with Pandas and Numpy, which are already at C-level so this will not help much
  • Clearer documentation
  • Allow GPU acceleration with Cupy

:bookmark_tabs: Sources

I cannot seem to find the original Basic and Combo Basic papers but I found papers that referenced them. "ATR4S: Toolkit with State-of-the-art Automatic Terms Recognition Methods in Scala" more or less summarizes everything and incorporates several algorithms not in this package. * Automatic Recognition of Multi-word Terms: The C-value/ NC-value Method * Domain-independent term extraction through domain modelling * ATR4S: Toolkit with State-of-the-art Automatic Terms Recognition Methods in Scala * TermExtractor: a Web Application to Learn the Shared Terminology of Emergent Web Communities * Learning Domain Ontologies from Document Warehouses and Dedicated Web Sites * A Comparative Evaluation of Term Recognition Algorithms * SemRe-Rank: Improving Automatic Term Extraction By Incorporating Semantic Relatedness With Personalised PageRank * Term extraction: A Review Draft Version 091221

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.