Table structure recognition dataset of the paper: Complicated Table Structure Recognition
SciTSR is a large-scale table structure recognition dataset, which contains 15,000 tables in PDF format and their corresponding structure labels obtained from LaTeX source files.
Download link is here.
There are 15,000 examples in total, and we split 12,000 for training and 3,000 for test. We also provide the test set that only contains complicated tables, called SciTSR-COMP. The indices of SciTSR-COMP is stored in
SciTSR-COMP.list.
The statistics of SciTSR dataset is following:
| | Train | Test | | --------------------------- | -----: | ----: | | # Tables | 12,000 | 3,000 | | # Complicated tables | 2,885 | 716 |
The directory tree structure is as follow:
SciTSR ├── SciTSR-COMP.list ├── test │ ├── chunk │ ├── img │ ├── pdf │ └── structure └── train ├── chunk ├── img ├── pdf ├── rel └── structure
The input PDF files are stored in
structuredirectory.
For convenience, we provide the input in image format stored in
img, which are converted from PDFs by
pdfcairo.
We also provide the extracted chunks stored in
chunk, which are pre-processed by Tabby.
For training data, we provide the our constructed relation labels for our GraphTSR model, which are generated by matching chunks and the texts of structure labels.
Note that our pre-processed chunk and relation data may contain noise. The original input files are in PDF.
File: chunk/[ID].chunk
The
posarray contains the
x1,
x2,
y1and
y2coordinates (in PDF) of the chunk.
{"chunks": [ { "pos": [ 147.96600341796875, 205.49998474121094, 475.7929992675781, 480.4206237792969 ], "text": "Probability" }, { "pos": [ 217.45510864257812, 290.6802673339844, 475.7929992675781, 480.4206237792969 ], "text": "Generated Text" }, ... ]}
File rel/[ID].rel
A line of
CHUNK_ID_1 CHUNK_ID_2 RELATION_ID:NUM_BLANKrepresents the relation between CHUNKID1-th chunk and CHUNKID2-th chunk is RELATIONID, and there are NUMBLANK blank cells between them. For RELATION_ID, 1 and 2 represents horizontal and vertical, respectively.
0 1 1:0 1 2 1:0 0 9 2:0 ...
File: structure/[ID].json
A table is stored as a list of cells. For each cell, we provide its original tex code, content (split by space) and position in the table (start/end row/column number, started from 0).
{"cells": [ { "id": 21, "tex": "959", "content": [ "959" ], "start_row": 5, "end_row": 5, "start_col": 1, "end_col": 1 }, { "id": 1, "tex": "Training set", "content": [ "Training", "set" ], "start_row": 0, "end_row": 0, "start_col": 1, "end_col": 1 }, ... ]}
The codes for vertex and edge features are at
./scitsr/graph.py.
You can get vertex features by
Vertex(vid, chunk, tab_h, tab_w).featuresand edge features by
Edge(vertex1, vertex2).features.
tab_hand
tab_wdenotes the height (y-axis) and width (x-axis) of the table.
See
./scitsr/graph.pyfor more details.
In the evaluation procedure, a table should be converted to a list of horizontally/vertically adjacent relations. Then we make a comparison between ground truth relations and output relations.
We release the evaluation scripts for comparing horizontally and vertically adjacent relations. In the following example (
./examples/eval.py), we show how to use the scripts to calculate precision/recall/F1 for an output table.
with open(json_path) as fp: json_obj = json.load(fp) # convert the structure labels (a table in json format) to a list of relations ground_truth_relations = json2Relations(json_obj, splitted_content=True) # your_relations should be a List of Relation. # Here we directly use the ground truth relations in the example. your_relations = ground_truth_relations precision, recall = eval_relations( gt=[ground_truth_relations], res=[your_relations], cmp_blank=True)
Note: Your output tables should be represented as
List[Relation]. You can also store a table as a
Tableobject and then convert it to
List[Relation]by using
scitsr.eval.Table2Relations.
Please cite the paper if you found the resources useful.
@article{chi2019complicated, title={Complicated Table Structure Recognition}, author={Chi, Zewen and Huang, Heyan and Xu, Heng-Da and Yu, Houjin and Yin, Wanxuan and Mao, Xian-Ling}, journal={arXiv preprint arXiv:1908.04729}, year={2019} }