Document Layout Analysis repos for development with PdfPig.
From wikipedia: __Document layout analysis_ is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from nontextual ones and the arrangement in their correct reading order. Detection and labeling of the different zones (or blocks) as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document (titles, captions, footnotes, etc.) and this kind of semantic labeling is the scope of the logical layout analysis._
In this repos, we will not considere scanned documents, but classic pdf documents and leverage all available information (e.g. letters bounding boxes, fonts).
Related projects
Resources
Text extraction
Word segmentation
Page segmentation
Recursive XY Cut
The XY cut segmentation algorithm, also referred to as recursive XY cuts (RXYC) algorithm, is a treebased topdown algorithm.
The root of the tree represents the entire document page. All the leaf nodes together represent the final segmentation. The RXYC algorithm recursively splits the document into two or more smaller rectangular blocks which represent the nodes of the tree. At each step of the recursion, the horizontal and vertical projection profiles of each node are computed. Then, the valleys along the horizontal and vertical directions, VX and VY, are compared to corresponding predefined thresholds TX and TY. If the valley is larger than the threshold, the node is split at the midpoint of the wider of VX and VY into two children nodes. The process continues until no leaf node can be split further. Then, noise regions are removed using noise removal thresholds TnX and TnY.
source
 Recursive XY Cut using Bounding Boxes of Connected Components  Jaekyu Ha, Robert M. Haralick and Ihsin T. Phillips
Docstrum
The Docstrum algorithm by Gorman is a bottomup approach based on nearestneighborhood clustering of connected components extracted from the document image. After noise removal, the connected components are separated into two groups, one with dominant characters and another one with characters in titles and section heading, using a character size ratio factor fd. Then, K nearest neighbors are found for each connected component. Then, textlines are found by computing the transitive closure on withinline nearest neighbor pairings using a threshold ft. Finally, textlines are merged to form text blocks using a parallel distance threshold fpa and a perpendicular distance threshold fpe.
source
 The Document Spectrum for Page Layout Analysis  Lawrence O'Gorman
 Document Structure and Layout Analysis  Anoop M. Namboodiri and Anil K. Jain
 Document Layout Analysis  Garrett Hoch
Voronoi
The Voronoidiagram based segmentation algorithm by Kise et al. is also a bottomup algorithm. In the first step, it extracts sample points from the boundaries of the connected components using a sampling rate sr. Then, noise removal is done using a maximum noise zone size threshold nm, in addition to width, height, and aspect ratio thresholds. After that the Voronoi diagram is generated using sample points obtained from the borders of the connected components. Superfluous Voronoi edges are deleted using a criterion involving the area ratio threshold ta, and the interline spacing margin control factor fr. Since we evaluate all algorithms on document pages with Manhattan layouts, a modified version of the algorithm is used to generate rectangular zones.
source
 Voronoi++: A Dynamic Page Segmentation approach based on Voronoi and Docstrum features  Mudit Agrawal and David Doermann
Constrained textline detection
The layout analysis approach by Breuel finds textlines as a two step process:
1. Find tall whitespace rectangles and evaluate them as candidates for gutters, column separators, etc. The algorithm for finding maximal empty whitespace is described in Breuel. The whitespace rectangles are returned in order of decreasing quality and are allowed a maximum overlap of Om.
2. The whitespace rectangles representing the columns are used as obstacles in a robust least square, globally optimal textline detection algorithm. Then, the bounding box of all the characters making the textline is computed.
The method was merely intended by its author as a demonstration of the application of two geometric algorithms, and not as a complete layout analysis system; nevertheless, we included it in the comparison because it has already proven useful in some applications. It is also nearly parameter free and resolution independent.
source
 Two Geometric Algorithms for Layout Analysis  Thomas M. Breuel
 High precision text extraction from PDF documents  Øyvind Raddum Berg
 High Performance Document Layout Analysis  Thomas M. Breuel
PDF/A standard
PDF/A1a compliant document make the following information available:
1. Language specification
2. Hierarchical document structure
3. Tagged text spans and descriptive text for images and symbols
4. Character mappings to Unicode
Zone classification/extraction & Reading order

Page Segmentation and Zone Classification: The State of the Art  O. Okun, D. Doermann, M. Pietikainen

Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers  C. Clark, S. Divvala

PDFFigures 2.0: Mining Figures from Research Papers  C. Clark, S. Divvala

Document image zone classification: A simple highperformance approach  D. Keysers, F. Shafait, T. M. Breuel

DocumentZone Classification using Partial Least Squares and Hybrid Classifiers  W. AbdAlmageed, M. Agrawal, W. Seo, D. Doermann

The Zonemap Metric for Page Segmentation and Area Classification in Scanned Documents  O. Galibert, J. Kahn and I. Oparin

Layout analysis and content classification in digitized books  A. Corbelli, L. Baraldi, F. Balducci, C. Grana, R. Cucchiara
Reading order
Table

A survey of table recognition  R. Zanibbi, D. Blostein, J.R. Cordy

Design of an endtoend method to extract information from tables  A. Costa e Silva, A. Jorge, L. Torgo

A Table Detection Method for PDF Documents Based on Convolutional Neural Networks  L. Hao, L. Gao, X. Yi, Z. Tang

Extracting Tables from Documents using Conditional Generative Adversarial Networks and Genetic Algorithms  N. Le Vine, M. Zeigenfuse, M. Rowany

Detecting Table Region in PDF Documents Using Distant Supervision  Miao Fan and Doo Soon Kim
 Automatic Tabular Data Extraction and Understanding  R. Rastan

Algorithmic Extraction of Data in Tables in PDF Documents  A. Nurminen

A MultiLayered Approach to Information Extraction from Tables in Biomedical Documents  N. Milosevic

Integrating and querying similar tables from PDF documentsusing deep learning  Rahul Anand, Hyeyoung Paik and Chen Wang

Locating Tables in Scanned Documents for Reconstructing and Republishing  MAC Akmal Jahan, Roshan G Ragel

Recognition of Tables and Forms  Bertrand Coüasnon, Aurélie Lemaitre

TableBank: Table Benchmark for Imagebased Table Detection and Recognition  M. Li, L. Cui, S. Huang, F. Wei, M. Zhou and Z. Li

Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers  Christopher Clark and Santosh Divvala 
website

A Table Detection Method for Multipage PDF Documents via Visual Seperators and Tabular Structures  J. Fang, L. Gao, K. Bai, R. Qiu, X. Tao, Z. Tang

A Rectangle Mining Method for Understandingthe Semantics of Financial Tables  X. Chen, L. Chiticariu, M. Danilevsky, A. Evfimievski and P. Sen

Table Header Detection and Classification  J. Fang, P. Mitra, Z. Tang, C. L. Giles

Configurable Table Structure Recognition in Untagged PDF Documents  A. Shigarov, A. Mikhailov, A. Altaev 
ppt

Complicated Table Structure Recognition  Z. Chi, H. Huang, H. Xu, H. Yu, W. Yin, X. Mao  github
Systems
Sparse line
Chart and diagram

FigureSeer: Parsing ResultFigures in Research Papers  N. Siegel, Z. Horvitz, R. Levin, S. Divvala, and A. Farhadi

Extraction, layout analysis and classification of diagrams in PDF documents  Robert P. Futrelle, Mingyan Shao, Chris Cieslik and Andrea Elaina Grimes

Graphics Recognition in PDF documents  Mingyan Shao and Robert P. Futrelle

A Study on the Document Zone Content Classification Problem
 Yalin Wang, Ihsin T. Phillips, and Robert M. Haralick
 Text/Figure Separation in Document Images Using Docstrum Descriptor and TwoLevel Clustering  Valery Anisimovskiy, Ilya Kurilin, Andrey Shcherbinin, Petr Pohl
 CHARTSynthetic

Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers  Christopher Clark and Santosh Divvala 
website

Metrics for Evaluating Data Extraction from Charts  Adobe Research  github
Mathematical expression

A Font Setting Based Bayesian Model to Extract Mathematical Expression in PDF Files  Xing Wang, JyhCharn Liu

Mathematical Formula Identification in PDF Documents  Xiaoyan Lin, Liangcai Gao, Zhi Tang, Xiaofan Lin

Faithful Mathematical Formula Recognition from PDF Documents  Josef B. Baker, Alan P. Sexton and Volker Sorge

Extracting Precise Data from PDF Documents for Mathematical Formula Recognition  Josef B. Baker, Alan P. Sexton and Volker Sorge

Mathematical formula identification and performance evaluation in PDF documents  Xiaoyan Lin, Liangcai Gao, Zhi Tang, Josef Baker, Volker Sorge
Margins recognition
NLP & ML

Chargrid: Towards Understanding 2D Documents  A. R. Katti, C. Reisswig, C. Guder, S. Brarda, S. Bickel, J. Höhne, J. B. Faddoul  medium

ChargridOCR: Endtoend trainable Optical Character Recognition through Semantic Segmentation and Object Detection  C. Reisswig, A. R. Katti, M. Spinaci, J. Höhne  slides

BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding  Timo I. Denk, Christian Reisswig  slides

LayoutLM: PreTraining of Text and Layout for Document Image Understanding  Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou  github

Detect2Rank: Combining Object Detectors UsingLearning to Rank  S. Karaoglu, Y. Liu., T. Gevers

DocParser: Hierarchical Structure Parsing of Document Renderings  J. Rausch, O. Martinez, F. Bissig, C. Zhang, and S. Feuerriegel  github  medium
Pretrained models
Workshops
Related topics
Bounding boxes
Images
Shape detection

Polygon Detection from a Set of Lines  Alfredo Ferreira, Manuel J. Fonseca, Joaquim A. Jorge

A Simple Approach to Recognise Geometric Shapes Interactively  Joaquim A. Jorge and Manuel J. Fonseca

The Detection of Rectangular Shape Objects Using Matching Schema  SooYoung Ye, JoonYoung Choi and KiGon Nam

Edge Detection Based Shape Identification  Vivek Kumar, Sumit Pandey, Amrindra Pal, Sandeep Sharma

Algorithms for the Reduction of the Number of Points Required to Represent a Digitized Line or its Caricature  David H. Douglas and Thomas K. Peucker

Shape description using cubic polynomial Bezier curves  L. Cinque, S. Levialdi, A. Malizia

New Algorithm for Medial Axis Transform of Plane Domain and details from stackoverflow  Choi, Choi, Moon and Wee
Character Recognition
Layout Similarity
Dehyphenation
Data structure
Datasets

DocBank: A Benchmark Dataset for Document Layout Analysis  M. Li, Y. Xu, L. Cui, S. Huang, F. Wei, Z. Li, M. Zhou 
github

PubLayNet: largest dataset ever for document layout analysis  Zhong, Tang and Yepes 
github
 ibm article

DocParser: Hierarchical Structure Parsing of Document Renderings  J. Rausch, O. Martinez, F. Bissig, C. Zhang, and S. Feuerriegel

TableBank: Table Benchmark for Imagebased Table Detection and Recognition  M. Li, L. Cui, S. Huang, F. Wei, M. Zhou and Z. Li

Document Image Datasets  Jonathan DeGange
Output file format
Validate and transform between OCR file formats (hOCR, ALTO, PAGE, FineReader)
Pdf page to image converter
A Pdf page to image converter is available to help in the research proces. It relies on the mupdf library, available in the sumatra pdf reader.
Pdf layout analysis viewer
A Pdf layout analysis viewer is available, also relies on the mupdf library.