Need help with SciTSR?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

Academic-Hammer
218 Stars 47 Forks MIT License 19 Commits 23 Opened issues

Description

Table structure recognition dataset of the paper: Complicated Table Structure Recognition

Services available

!
?

Need anything else?

Contributors list

SciTSR

Introduction

SciTSR is a large-scale table structure recognition dataset, which contains 15,000 tables in PDF format and their corresponding structure labels obtained from LaTeX source files.

Download link is here.

There are 15,000 examples in total, and we split 12,000 for training and 3,000 for test. We also provide the test set that only contains complicated tables, called SciTSR-COMP. The indices of SciTSR-COMP is stored in

SciTSR-COMP.list
.

The statistics of SciTSR dataset is following:

| | Train | Test | | --------------------------- | -----: | ----: | | # Tables | 12,000 | 3,000 | | # Complicated tables | 2,885 | 716 |

Format and Example

The directory tree structure is as follow:

SciTSR
├── SciTSR-COMP.list
├── test
│   ├── chunk
│   ├── img
│   ├── pdf
│   └── structure
└── train
    ├── chunk
    ├── img
    ├── pdf
    ├── rel
    └── structure

The input PDF files are stored in

pdf
, and the structure labels are stored in the
structure
directory.

For convenience, we provide the input in image format stored in

img
, which are converted from PDFs by
pdfcairo
.

We also provide the extracted chunks stored in

chunk
, which are pre-processed by Tabby.

For training data, we provide the our constructed relation labels for our GraphTSR model, which are generated by matching chunks and the texts of structure labels.

Note that our pre-processed chunk and relation data may contain noise. The original input files are in PDF.

Text Chunks

File: chunk/[ID].chunk

The

pos
array contains the
x1
,
x2
,
y1
and
y2
coordinates (in PDF) of the chunk.
{"chunks": [
  {
    "pos": [
      147.96600341796875,
      205.49998474121094,
      475.7929992675781,
      480.4206237792969
    ],
    "text": "Probability"
  },
  {
    "pos": [
      217.45510864257812,
      290.6802673339844,
      475.7929992675781,
      480.4206237792969
    ],
    "text": "Generated Text"
  },
  ...
 ]}

Relations

File rel/[ID].rel

A line of

CHUNK_ID_1 CHUNK_ID_2 RELATION_ID:NUM_BLANK
represents the relation between CHUNKID1-th chunk and CHUNKID2-th chunk is RELATIONID, and there are NUMBLANK blank cells between them. For RELATION_ID, 1 and 2 represents horizontal and vertical, respectively.
0 1 1:0
1 2 1:0
0 9 2:0
...

Structure Labels

File: structure/[ID].json

A table is stored as a list of cells. For each cell, we provide its original tex code, content (split by space) and position in the table (start/end row/column number, started from 0).

{"cells": [
  {
    "id": 21,
    "tex": "959",
    "content": [
      "959"
    ],
    "start_row": 5,
    "end_row": 5,
    "start_col": 1,
    "end_col": 1
  },
  {
    "id": 1,
    "tex": "Training set",
    "content": [
      "Training",
      "set"
    ],
    "start_row": 0,
    "end_row": 0,
    "start_col": 1,
    "end_col": 1
  },
  ...
]}

Implementation Details

Features

The codes for vertex and edge features are at

./scitsr/graph.py
.

You can get vertex features by

Vertex(vid, chunk, tab_h, tab_w).features
and edge features by
Edge(vertex1, vertex2).features
.

tab_h
and
tab_w
denotes the height (y-axis) and width (x-axis) of the table.

See

./scitsr/graph.py
for more details.

Evaluation

In the evaluation procedure, a table should be converted to a list of horizontally/vertically adjacent relations. Then we make a comparison between ground truth relations and output relations.

We release the evaluation scripts for comparing horizontally and vertically adjacent relations. In the following example (

./examples/eval.py
), we show how to use the scripts to calculate precision/recall/F1 for an output table.
with open(json_path) as fp: json_obj = json.load(fp)
# convert the structure labels (a table in json format) to a list of relations
ground_truth_relations = json2Relations(json_obj, splitted_content=True)
# your_relations should be a List of Relation.
# Here we directly use the ground truth relations in the example.
your_relations = ground_truth_relations
precision, recall = eval_relations(
  gt=[ground_truth_relations], res=[your_relations], cmp_blank=True)

Note: Your output tables should be represented as

List[Relation]
. You can also store a table as a
Table
object and then convert it to
List[Relation]
by using
scitsr.eval.Table2Relations
.

Citation

Please cite the paper if you found the resources useful.

@article{chi2019complicated,
  title={Complicated Table Structure Recognition},
  author={Chi, Zewen and Huang, Heyan and Xu, Heng-Da and Yu, Houjin and Yin, Wanxuan and Mao, Xian-Ling},
  journal={arXiv preprint arXiv:1908.04729},
  year={2019}
}

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.