An implementation of RESTful web service for tesseract-OCR using tornado
An implementation of RESTful web service for tesseract-OCR. The HTTP server is implemented using tornado. A Docker Container has been created to let you run this service without any installation efforts!
As of tesseract-ocr version 3.02.02, it provides a C-API. Now while calling the "Fetch Image From URL" API, operations are done in memory for better performance. No file I/O is required. The python implementation of C API wrapper using ctypes can be found in tesseractcapi.py. Bulk processing is planned to appear in the future version.
A full list of C APIs supported in tesseract-ocr version 3.02.02 is at here with detailed signatures and comments.
Upload Image File: /upload Fetch Image From URL: /fetchurl
Python Requirement
version >= 2.7
Install tornado, PIL image library and other required packages by apt-get.
sudo apt-get update && sudo apt-get install -y \ autoconf \ automake \ autotools-dev \ build-essential \ checkinstall \ libjpeg-dev \ libpng-dev \ libtiff-dev \ libtool \ python \ python-imaging \ python-tornado \ wget \ zlib1g-dev
You need to compile and install leptonica and the latest version (3.02.02) of tesseract-ocr manually to support C API. More details can be found at this wiki. Here is an example on Ubuntu 12.04 LTS:
mkdir ~/temp \ && cd ~/temp/ \ wget http://www.leptonica.org/source/leptonica-1.69.tar.gz \ && tar -zxvf leptonica-1.69.tar.gz \ && cd leptonica-1.69 \ && ./configure \ && make \ && checkinstall \ && ldconfigcd ~/temp/
&& wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.02.tar.gz
&& tar xvf tesseract-ocr-3.02.02.tar.gz
&& cd tesseract-ocr
&& ./autogen.sh
&& mkdir ~/local
&& ./configure --prefix=$HOME/local/
&& make
&& make install
Only English letters and digits are supported by default. You can download more language packs, such as Simplified/Traditional Chinese pack from http://code.google.com/p/tesseract-ocr/downloads/list. Decompress and put the packs under '~/local/share/' or other locations you like.
mkdir ~/local/share -p cd ~/local/share \ && wget https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.eng.tar.gz \ && tar xvf tesseract-ocr-3.02.eng.tar.gzls ~/local/share/tesseract-ocr/tessdata
configs eng.cube.params eng.traineddata.tmp eng.cube.bigrams eng.cube.size equ.traineddata eng.cube.fold eng.cube.word-freq osd.traineddata eng.cube.lm eng.tesseract_cube.nn tessconfigs eng.cube.nn eng.traineddata
Be sure to set the parent folder path of language packs in environment variables, for instance:
export TESSDATA_PREFIX=/home/markpeng/local/share/tesseract-ocr/
Create a folder named 'static' under current folder (for instance, '/opt/ocr') to keep temp files
sudo mkdir /opt/ocr/static -p
Then put all .py files to /opt/ocr and make them executable.
sudo cp ~/Share/tesseract-web-service/* /opt/ocr sudo chmod 755 /opt/ocr/*.py
Note: you should go to the folder path containing the static folder to make the service work.
cd /opt/ocr
Now, start tesseract-web-service by:
python tesseractserver.py -b "/home/markpeng/local/lib" -d "/home/markpeng/local/share/tesseract-ocr"
Type the following command to check the options.
python tesseractserver.py -hUsage: tesseractserver.py [options]
Options: -h, --help show this help message and exit -p PORT, --port=PORT the listening port of RESTful tesseract web service. (default: 1688) -l LANG, --lang=LANG the targe language. (default: eng) -b LIBPATH, --lib-path=LIBPATH the absolute path of tesseract library. -d TESSDATA, --tessdata-folder=TESSDATA the absolute path of tessdata folder containing language packs.
The default listening port is 1688. Change it to yours on startup. Please make sure that the firewall is opened for listening port.
For example, you can change the port to 8080 by:
python /opt/ocr/tesseractserver.py -p 8080 -b "/home/markpeng/local/lib" -d "/home/markpeng/local/share/tesseract-ocr"
To start it as a persistent service even after terminal logout:
sudo nohup python /opt/ocr/tesseractserver.py -p 8080 -b "/home/markpeng/local/lib" -d "/home/markpeng/local/share/tesseract-ocr" &
Python Requirement
version >= 2.7
Install tornado, PIL image library and other required packages by apt-get.
sudo apt-get update && sudo apt-get install -y \ autoconf \ automake \ autotools-dev \ build-essential \ checkinstall \ libjpeg-dev \ libpng-dev \ libtiff-dev \ libtool \ python \ python-imaging \ python-tornado \ wget \ zlib1g-dev
Install the tesseract library:
sudo apt-get install tesseract-ocr-dev
Correct the Filename( or use this Repository):
class TesseactWrapper: def __init__(self, lang, libpath, tessdata): libname = libpath + "/libtesseract.so.3.0.3"
Check for English training data (Filename: eng.traineddata in /usr/share/tesseract-ocr/tessdata/). If not exist do:
wget https://tesseract-ocr.googlecode.com/files/eng.traineddata.gz gunzip eng.traineddata.gz sudo mv -v eng.traineddata /usr/local/share/tessdata/
Create a static folder in repo main directory:
mkdir static
Now, start tesseract-web-service by:
python tesseractserver.py -p 1688 -b /usr/lib -d /usr/share/tesseract-ocr/tessdata/
The web service provides two HTTP GET pages for testing the API:
Upload Image File: http://localhost:1688/upload Fetch Image From URL: http://localhost:1688/fetchurl
The results are returned in JSON format with OCR result strings.
If you would like to call "Fetch Image From URL" API with POST, please send a HTTP request header similar to the following:
POST /fetchurl HTTP/1.1 Host: localhost:1688 Connection: keep-alive Content-Length: 214 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 Origin: http://localhost:1688 User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.114 Safari/537.36 Content-Type: multipart/form-data; boundary=----WebKitFormBoundarylFMK6PAyVCzNCDAr Referer: http://localhost:1688/fetchurl Accept-Encoding: gzip,deflate,sdch Accept-Language: zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4d license..... POST data payload: imageUrl = 'http://xxxxxxx'
If you send POST data by JSON, you need to provide a 'url' key, which contains target image url.
Example POST data in JSON:
data: { 'url': 'http://price1.suning.cn/webapp/wcs/stores/prdprice/89218_9173_10000_9-1.png' }
Then you shall get a JSON response similar to the following:
data: { 'url': 'http://price1.suning.cn/webapp/wcs/stores/prdprice/89218_9173_10000_9-1.png', 'result': '2158.00' }
Note that for /upload API, since there is no url provided, only result string will be returned in JSON.
tesseractclient.py is a client for calling the "Fetch Image From URL" API.
Type the following command to check the options.
python /opt/ocr/tesseractclient.py --helpUsage: tesseractclient.py [options]
Options: -h, --help show this help message and exit -a APIURL, --api-url=APIURL the URL of RESTful tesseract web service -i IMAGEURL, --image-url=IMAGEURL the URL of image to do OCR
For instance:
python /opt/ocr/tesseractclient.py -a "http://localhost:1688/fetchurl" -i "http://www.greatdreams.com/666-magicsquare.gif"
You should provide the API url and image source url to make it work.
Install Docker to your host by following the official installation guide.
After that, execute the following command to download Docker Image (packaged in Ubuntu 12.04 LTS):
docker pull guitarmind/tesseract-web-service
To run the web service using container, just type:
docker run --rm -d -p 1688:1688 guitarmind/tesseract-web-service
Note that the -p flag is used to bind local port with Container's virtual port. By default it is set to 1688. You can change it by modifying the Dockerfile. The -d flag means to run it in daemon mode.
The container has been created as an Automated Build:
https://registry.hub.docker.com/u/guitarmind/tesseract-web-service/
Features:
Features:
Features:
Features:
Features:
Author: Mark Peng (markpeng.ntu at gmail)
All codes are under the Apache 2.0 license.