python-stanford-corenlp

by stanfordnlp

Python interface to CoreNLP using a bidirectional server-client interface.

462 Stars 111 Forks Last release: Not found MIT License 58 Commits 0 Releases

Available items

No Items, yet!

The developer of this repository has not created any items for sale yet. Need a bug fixed? Help with integration? A different license? Create a request here:

Stanford CoreNLP Python Interface

NOTE: This package is now deprecated. Please use the

stanza 
_ package instead.

.. image:: https://travis-ci.org/stanfordnlp/python-stanford-corenlp.svg?branch=master :target: https://travis-ci.org/stanfordnlp/python-stanford-corenlp

This package contains a python interface for

Stanford CoreNLP
_ that contains a reference implementation to interface with the
Stanford CoreNLP server
_. The package also contains a base class to expose a python-based annotation provider (e.g. your favorite neural NER system) to the CoreNLP pipeline via a lightweight service.

To use the package, first download the

official java CoreNLP release 
, unzip it, and define an environment variable :code:`$CORENLPHOME` that points to the unzipped directory.

You can also install this package from

PyPI 
_ using :code:
pip install stanford-corenlp

Command Line Usage

Probably the easiest way to use this package is through the

annotate
command-line utility::
usage: annotate [-h] [-i INPUT] [-o OUTPUT] [-f {json}]
                [-a ANNOTATORS [ANNOTATORS ...]] [-s] [-v] [-m MEMORY]
                [-p PROPS [PROPS ...]]

Annotate data

optional arguments: -h, --help show this help message and exit -i INPUT, --input INPUT Input file to process; each line contains one document (default: stdin) -o OUTPUT, --output OUTPUT File to write annotations to (default: stdout) -f {json}, --format {json} Output format -a ANNOTATORS [ANNOTATORS ...], --annotators ANNOTATORS [ANNOTATORS ...] A list of annotators -s, --sentence-mode Assume each line of input is a sentence. -v, --verbose-server Server is made verbose -m MEMORY, --memory MEMORY Memory to use for the server -p PROPS [PROPS ...], --props PROPS [PROPS ...] Properties as a list of key=value pairs

We recommend using

annotate
in conjuction with the wonderful
jq
command to process the output. As an example, given a file with a sentence on each line, the following command produces an equivalent space-separated tokens::
cat file.txt | annotate -s -a tokenize | jq '[.tokens[].originalText]' > tokenized.txt

Annotation Server Usage

.. code-block:: python

import corenlp

text = "Chris wrote a simple sentence that he parsed with Stanford CoreNLP."

# We assume that you've downloaded Stanford CoreNLP and defined an environment # variable $CORENLP_HOME that points to the unzipped directory. # The code below will launch StanfordCoreNLPServer in the background # and communicate with the server to annotate the sentence. with corenlp.CoreNLPClient(annotators="tokenize ssplit pos lemma ner depparse".split()) as client: ann = client.annotate(text)

# You can access annotations using ann. sentence = ann.sentence[0]

# The corenlp.totext function is a helper function that # reconstructs a sentence from tokens. assert corenlp.totext(sentence) == text

# You can access any property within a sentence. print(sentence.text)

# Likewise for tokens token = sentence.token[0] print(token.lemma)

# Use tokensregex patterns to find who wrote a sentence. pattern = '([ner: PERSON]+) /wrote/ /an?/ []{0,3} /sentence|article/' matches = client.tokensregex(text, pattern) # sentences contains a list with matches for each sentence. assert len(matches["sentences"]) == 1 # length tells you whether or not there are any matches in this assert matches["sentences"][0]["length"] == 1 # You can access matches like most regex groups. matches["sentences"][1]["0"]["text"] == "Chris wrote a simple sentence" matches["sentences"][1]["0"]["1"]["text"] == "Chris"

# Use semgrex patterns to directly find who wrote what. pattern = '{word:wrote} >nsubj {}=subject >dobj {}=object' matches = client.semgrex(text, pattern) # sentences contains a list with matches for each sentence. assert len(matches["sentences"]) == 1 # length tells you whether or not there are any matches in this assert matches["sentences"][0]["length"] == 1 # You can access matches like most regex groups. matches["sentences"][1]["0"]["text"] == "wrote" matches["sentences"][1]["0"]["$subject"]["text"] == "Chris" matches["sentences"][1]["0"]["$object"]["text"] == "sentence"

See

test_client.py
and
test_protobuf.py
for more examples. Props to @dan-zheng for tokensregex/semgrex support.

Annotation Service Usage

NOTE: The annotation service allows users to provide a custom annotator to be used by the CoreNLP pipeline. Unfortunately, it relies on experimental code internal to the Stanford CoreNLP project is not yet available for public use.

.. code-block:: python

import corenlp
from .happyfuntokenizer import Tokenizer

class HappyFunTokenizer(Tokenizer, corenlp.Annotator): def init(self, preserve_case=False): Tokenizer.init(self, preserve_case) corenlp.Annotator.init(self)

@property
def name(self):
    """
    Name of the annotator (used by CoreNLP)
    """
    return "happyfun"

@property
def requires(self):
    """
    Requires has to specify all the annotations required before we
    are called.
    """
    return []

@property
def provides(self):
    """
    The set of annotations guaranteed to be provided when we are done.
    NOTE: that these annotations are either fully qualified Java
    class names or refer to nested classes of
    edu.stanford.nlp.ling.CoreAnnotations (as is the case below).
    """
    return ["TextAnnotation",
            "TokensAnnotation",
            "TokenBeginAnnotation",
            "TokenEndAnnotation",
            "CharacterOffsetBeginAnnotation",
            "CharacterOffsetEndAnnotation",
           ]

def annotate(self, ann):
    """
    @ann: is a protobuf annotation object.
    Actually populate @ann with tokens.
    """
    buf, beg_idx, end_idx = ann.text.lower(), 0, 0
    for i, word in enumerate(self.tokenize(ann.text)):
        token = ann.sentencelessToken.add()
        # These are the bare minimum required for the TokenAnnotation
        token.word = word
        token.tokenBeginIndex = i
        token.tokenEndIndex = i+1

        # Seek into the txt until you can find this word.
        try:
            # Try to update beginning index
            beg_idx = buf.index(word, beg_idx)
        except ValueError:
            # Give up -- this will be something random
            end_idx = beg_idx + len(word)

        token.beginChar = beg_idx
        token.endChar = end_idx

        beg_idx, end_idx = end_idx, end_idx

annotator = HappyFunTokenizer()

Calling .start() will launch the annotator as a service running on

port 8432 by default.

annotator.start()

annotator.properties contains all the right properties for

Stanford CoreNLP to use this annotator.

with corenlp.CoreNLPClient(properties=annotator.properties, annotators="happyfun ssplit pos".split()) as client: ann = client.annotate("RT @ #happyfuncoding: this is a typical Twitter tweet :-)")

tokens = [t.word for t in ann.sentence[0].token]
print(tokens)

See

test_annotator.py
for more examples.

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.