Transition-based statistical parser
The developer of this repository has not created any items for sale yet. Need a bug fixed? Help with integration? A different license? Create a request here:
This library is research code, and is in maintainence mode.
For my actively developed, commercially-focussed NLP library, see http://honnibal.github.io/spaCy/
Redshift is a natural-language syntactic dependency parser. The current release features fast and accurate parsing, but requires the text to be pre-processed. Future releases will integrate tokenisation and part-of-speech tagging, and have special features for parsing informal text.
If you don't know what a syntactic dependency is, read this: http://googleresearch.blogspot.com.au/2013/05/syntactic-ngrams-over-time.html
Here is an example of how the parser is called from Python, once you have a model trained:
>>> import redshift.parser >>> from redshift.sentence import Input >>> parser = redshift.parser.Parser() >>> sentence = Input.from_untagged(['A', 'list', 'of', 'tokens', 'is', 'required', '.']) >>> parser.parse(sentence) >>> print sentence.to_conll()
The command-line interfaces have a lot of probably-confusing options for my current research. The main scripts I use are scripts/train.py, scripts/parse.py, and scripts/evaluate.py . All print usage information, and require the plac library.
From a Unix/OSX terminal, after compilation, and within the "redshift" directory:
$ export PYTHONPATH=`pwd` $ ./scripts/train.py # Use -h or --help for more detailed info. Most of these are research flags. usage: train.py [-h] [-a static] [-i 15] [-k 1] [-f 10] [-r] [-d] [-u] [-n 0] [-s 0] train_loc model_loc train.py: error: too few arguments $ ./scripts/train.py -k 16 $ ./scripts/parse.py $ ./scripts/evaluate.py output_dir/parses
In more detail:
The following commands will set up a virtualenv with Python 2.7.5, the parser, and its core dependencies from scratch::
$ git clone https://github.com/syllog1sm/redshift.git $ cd redshift $ git checkout develop
EITHER a) $ virtualenv .env OR b) $ ./make_virtualenv.sh # Downloads Python 2.7.5 and virtualenv
$ source .env/bin/activate $ pip install distribute $ pip install cython $ pip install thinc $ pip install -r requirements.txt $ export PYTHONPATH=
pwd:$PYTHONPATH # ...and set PYTHONPATH. $ fab make test
The make_virtualenv.sh script downloads and compiles Python 2.7.5, and uses it to create a virtualenv. This is one way to use a version of Python that isn't system-wide, or to control the compiler that Cython will use. You may not need to do this, or you may wish to do it manually --- it's up to you.
virtualenv is not a requirement, although it's useful. If a virtualenv is not active (i.e. if the $VIRTUALENV environment variable is not set), you'll need to ensure that the setup.py file knows where to find the C headers that the murmurhash dependency installs.
Installation requires a recent version of pip, which is provided by the version of virtualenv that the makevirtualenv.sh script downloads. If you don't use the makevirtualenv.sh script, ensure you're using a recent version of pip.
redshift is written almost entirely in Cython, a superset of the Python language that additionally supports calling C/C++ functions and declaring C/C++ types on variables and class attributes. This allows the compiler to generate very efficient C/C++ code from Cython code. Many popular Python packages, such as numpy, scipy and lxml, rely heavily on Cython code.
A Cython source file such as redshift/parser.pyx is compiled into redshift/parser.cpp and redshift/parser.so by the project's setup.py file. The module can then by imported by standard Python code, although only the pure-Python functions (declared by "def" and "cpdef", instead of "cdef") will be accessible.
The parser currently has Cython as a requirement, instead of distributing the "compiled" .cpp files as part of the release (against Cython's recommendation). This could change in future, but currently it feels strange to have a "source" release that users wouldn't be able to modify.
This software is available for non-commercial use only. You may download, run and modify the code for research purposes, personal interest, education, teaching, etc. My commercial NLP suite is spaCy: http://spacy.io .
Copyright (C) 2014 Matthew Honnibal