Hybrid page layout analysis via tab-stop detection

Abstract

A new hybrid page layout analysis algorithm is proposed, which uses bottom-up methods to form an initial data-type hypothesis and locate the tab-stops that were used when the page was formatted. The detected tab-stops, are used to deduce the column layout of the page. The column layout is then applied in a top-down manner to impose structure and reading-order on the detected regions. The complete C++ source code implementation is available as part of the Tesseract open source OCR engine at here .

Past Methods 1: Bottom-up

Analyze groups of pixels or connected components to classify into text/image/graphic/blank/line
Spread/smear/anneal groups of pixels by some neighborhood voting scheme, morphology or voronoi/graph algorithms.
Find connected components of labels to group pixels into typed regions.
Box-up regions into rectangles where possible.
Morphological approach is very similar.
Hard to include knowledge like "Columns should usually be the same size."

Past Methods 2: Top Down

Often starts with a (possibly pre-trained) model of layout, eg 2-column journal page
Attempts to cut the image into the required parts, either with recursive vertical/horizontal cuts, or finding rectangles of whitespace.
Methods usually fail on non-rectangular regions.
Methods can often only deal with pages that fit the model.

New Method: Hybrid

Hybrid layout

最后编辑于：2017.12.04 18:35:57

Hybrid page layout analysis via tab-stop detection

Abstract

Past Methods 1: Bottom-up

Past Methods 2: Top Down

New Method: Hybrid

推荐阅读更多精彩内容