yomu

by yomurb

yomurb /yomu

Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)

458 Stars 101 Forks Last release: Not found MIT License 98 Commits 16 Releases

Available items

No Items, yet!

The developer of this repository has not created any items for sale yet. Need a bug fixed? Help with integration? A different license? Create a request here:

Travis Build Status Code Climate Score Gem Version

Yomu 読む

Gitter

Yomu is a library for extracting text and metadata from files and documents using the Apache Tika content analysis toolkit.

Here are some of the formats supported:

  • Microsoft Office OLE 2 and Office Open XML Formats (.doc, .docx, .xls, .xlsx, .ppt, .pptx)
  • OpenOffice.org OpenDocument Formats (.odt, .ods, .odp)
  • Apple iWorks Formats
  • Rich Text Format (.rtf)
  • Portable Document Format (.pdf)

For the complete list of supported formats, please visit the Apache Tika Supported Document Formats page.

Usage

Text, metadata and MIME type information can be extracted by calling

Yomu.read
directly:
require 'yomu'

data = File.read 'sample.pages' text = Yomu.read :text, data metadata = Yomu.read :metadata, data mimetype = Yomu.read :mimetype, data

Reading text from a given filename

Create a new instance of Yomu and pass a filename.

yomu = Yomu.new 'sample.pages'
text = yomu.text

Reading text from a given URL

This is useful for reading remote files, like documents hosted on Amazon S3.

yomu = Yomu.new 'http://svn.apache.org/repos/asf/poi/trunk/test-data/document/sample.docx'
text = yomu.text

Reading text from a stream

Yomu can also read from a stream or any object that responds to

read
, including file uploads from Ruby on Rails or Sinatra.
post '/:name/:filename' do
  yomu = Yomu.new params[:data][:tempfile]
  yomu.text
end

Reading metadata

Metadata is returned as a hash.

yomu = Yomu.new 'sample.pages'
yomu.metadata['Content-Type'] #=> "application/vnd.apple.pages"

Reading MIME types

MIME type is returned as a MIME::Type object.

yomu = Yomu.new 'sample.docx'
yomu.mimetype.content_type #=> "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
yomu.mimetype.extensions #=> ['docx']

Installation and Dependencies

Java Runtime

Yomu packages the Apache Tika application jar and requires a working JRE for it to work.

Gem

Add this line to your application's Gemfile:

gem 'yomu'

And then execute:

$ bundle

Or install it yourself as:

$ gem install yomu

Contributing

  1. Fork it
  2. Create your feature branch (
    git checkout -b my-new-feature
    )
  3. Create tests and make them pass (
    rake test
    )
  4. Commit your changes (
    git commit -am 'Added some feature'
    )
  5. Push to the branch (
    git push origin my-new-feature
    )
  6. Create a new Pull Request

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.