Need help with ocr-table?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

cseas
187 Stars 58 Forks MIT License 15 Commits 3 Opened issues

Description

Extract tables from scanned image PDFs using Optical Character Recognition.

Services available

!
?

Need anything else?

Contributors list

# 30,511
GraphQL
skype
gmail
hipchat
13 commits

ocr-table

This project aims to extract tables from scanned image PDFs using Optical Character Recognition.

Install Requirements

  1. Tesseract OCR

    sh
    sudo apt-get install tesseract-ocr
    
  2. Imagemagick

    sh
    sudo apt-get install imagemagick
    
  3. PDF Utilities

    sh
    sudo apt-get install poppler-utils
    
  4. Python packages

    sh
    sudo pip install -r requirements.txt
    

Usage

  1. Clear the pdf/ folder and copy all your pdf files to be scanned in it.

  2. Run the OCR:

    sh
    python3 shellocr.py
    
  3. The scanned text files shall be available in the txt/ folder once the process completes.

Alternate

  1. If the above doesn't work for you, try the alternate method.

  2. Save your file as input.pdf in the root directory.

  3. Run

    sh
    python3 pdf_miner.py 
    

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.