Abstract
A new hybrid page layout analysis algorithm is proposed, which uses bottom-up methods to form an initial data-type hypothesis and locate the tab-stops that were used when the page was formatted. The detected tab-stops, are used to deduce the column layout of the page. The column layout is then applied in a top-down manner to impose structure and reading-order on the detected regions. The complete C++ source code implementation is available as part of the Tesseract open source OCR engine at here .
Past Methods 1: Bottom-up
- Analyze groups of pixels or connected components to classify into text/image/graphic/blank/line
- Spread/smear/anneal groups of pixels by some neighborhood voting scheme, morphology or voronoi/graph algorithms.
- Find connected components of labels to group pixels into typed regions.
- Box-up regions into rectangles where possible.
- Morphological approach is very similar.
- Hard to include knowledge like "Columns should usually be the same size."
Past Methods 2: Top Down
- Often starts with a (possibly pre-trained) model of layout, eg 2-column journal page
- Attempts to cut the image into the required parts, either with recursive vertical/horizontal cuts, or finding rectangles of whitespace.
- Methods usually fail on non-rectangular regions.
- Methods can often only deal with pages that fit the model.