Hybrid page layout analysis via tab-stop detection

Abstract

  A new hybrid page layout analysis algorithm is proposed, which uses bottom-up methods to form an initial data-type hypothesis and locate the tab-stops that were used when the page was formatted. The detected tab-stops, are used to deduce the column layout of the page. The column layout is then applied in a top-down manner to impose structure and reading-order on the detected regions. The complete C++ source code implementation is available as part of the Tesseract open source OCR engine at here .

Past Methods 1: Bottom-up

  • Analyze groups of pixels or connected components to classify into text/image/graphic/blank/line
  • Spread/smear/anneal groups of pixels by some neighborhood voting scheme, morphology or voronoi/graph algorithms.
  • Find connected components of labels to group pixels into typed regions.
  • Box-up regions into rectangles where possible.
  • Morphological approach is very similar.
  • Hard to include knowledge like "Columns should usually be the same size."

Past Methods 2: Top Down

  • Often starts with a (possibly pre-trained) model of layout, eg 2-column journal page
  • Attempts to cut the image into the required parts, either with recursive vertical/horizontal cuts, or finding rectangles of whitespace.
  • Methods usually fail on non-rectangular regions.
  • Methods can often only deal with pages that fit the model.

New Method: Hybrid

Hybrid layout
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容