by columbia-applied-data-science

columbia-applied-data-science / rosetta

Tools, wrappers, etc... for data science with a concentration on text processing

199 Stars 45 Forks Last release: about 5 years ago (v0.3.0) Other 273 Commits 8 Releases

Available items

No Items, yet!

The developer of this repository has not created any items for sale yet. Need a bug fixed? Help with integration? A different license? Create a request here:


Tools for data science with a focus on text processing.

  • Focuses on "medium data", i.e. data too big to fit into memory but too small to necessitate the use of a cluster.
  • Integrates with existing scientific Python stack as well as select outside tools.


  • See the
  • The docs contain plots of example output.



  • Unix-like command line utilities. Filters (read from stdin/write to stdout) for files.
  • Focus on stream processing and csv files.


  • Wrappers for Python multiprocessing that add ease of use
  • Memory-friendly multiprocessing


  • Stream text from disk to formats used in common ML processes
  • Write processed text to sparse formats
  • Helpers for ML tools (e.g. Vowpal Wabbit, Gensim, etc...)
  • Other general utilities


  • High-level wrappers that have helped with our workflow and provide additional examples of code use


  • General ML modeling utilities


Check out the master branch from the rosettarepo. Then, (so long as you have

cd rosetta
make test

If you update the source, you can do

make reinstall
make test

The above

targets use
, so you can of course do
pip uninstall
at any time.

Getting the source (above) is the preferred method since the code changes often, but if you don't use Git you can download a tagged release (tarball) here. Then

pip install rosetta-X.X.X.tar.gz



You can get the latest sources with

git clone git://


Feel free to contribute a bug report or a request by opening an issue

The preferred method to contribute is to fork and send a pull request. Before doing this, read


  • Major dependencies on Pandas and numpy.
  • Minor dependencies on Gensim and statsmodels.
  • Some examples need scikit-learn.
  • Minor dependencies on docx
  • Minor dependencies on the unix utilities pdftotext and catdoc


From the base repo directory,

, you can run all tests with
make test


Documentation for releases is hosted at pypi. This does NOT auto-update.


Rosetta refers to the Rosetta Stone, the ancient Egyptian tablet discovered just over 200 years ago. The tablet contained fragmented text in three different languages and the uncovering of its meaning is considered an essential key to our understanding of Ancient Egyptian civilization. We would like this project to provide individuals the necessary tools to process and unearth insight in the ever-growing volumes of textual data of today.

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.