Tesseract源码分析(一)——二值化与版面分析

tess4.0中主要的数据结构

  1. Page analysis result: PAGE_RES (ccstruct/pageres.h).
  2. Page analysis result contains a list of block analysis result field: BLOCK_RES_LIST.
  3. Block analysis result: BLOCK_RES (ccstruct/pageres.h).
  4. Block analysis result contains a list of row analysis result field: ROW_RES_LIST.
  5. Row analysis result: ROW_RES (ccstruct/pageres.h).
  6. Row analysis result contains a list of word analysis result field: WERD_RES_LIST.
  7. WERD_RES(ccstruct/pageres.h) is a collection of publicly accessible members that gathers information about a word result.

源码分析

Tesseract主要文字识别主要流程:二值化,切分处理,识别,纠错等步骤。本文主要总结二值化和预处理两部分步骤的处理过程。

Page Layout 分析步骤

二值化

  • 算法: OTSU
  • 调用栈
    1. main[api/tesseractmain.cpp] ->
    2. TessBaseAPI::ProcessPages[api/baseapi.cpp] ->
    3. TessBaseAPI::ProcessPage[api/baseapi.cpp] ->
    4. TessBaseAPI::Recognize[api/baseapi.cpp] ->
    5. TessBaseAPI::FindLines[api/baseapi.cpp] ->
    6. TessBaseAPI::Threshold[api/baseapi.cpp] ->
    7. ImageThresholder::ThresholdToPix[ccmain/thresholder.cpp] ->
    8. ImageThresholder::OtsuThresholdRectToPix [ccmain/thresholder.cpp]

OTSU 是一个全局二值化算法. 如果图片中包含阴影而且阴影不平均,二值化算法效果就会比较差。OCRus利用一个局部的二值化算法,olf Jolion, 对包含有阴影的图片也有比较好的二值化结果。

切分处理

Remove vertical lines

This step removes vertical and horizontal lines in the image.

  • 调用栈
    1. main [api/tesseractmain.cpp] ->
    2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
    3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
    4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
    5. TessBaseAPI::FindLines [api/baseapi.cpp] ->
    6. Tesseract::SegmentPage [ccmain/pagesegmain.cpp] ->
    7. Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] ->
    8. Tesseract::SetupPageSegAndDetectOrientation [ccmain/ pagesegmain.cpp]
    9. LineFinder::FindAndRemoveLines [textord/linefind.cpp]

Remove images

This step remove images from the picture.

  • 调用栈

    1. main [api/tesseractmain.cpp] ->
    2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
    3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
    4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
    5. TessBaseAPI::FindLines [api/baseapi.cpp] ->
    6. Tesseract::SegmentPage [ccmain/pagesegmain.cpp] ->
    7. Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] ->
    8. Tesseract::SetupPageSegAndDetectOrientation [ccmain/ pagesegmain.cpp]
    9. ImageFind::FindImages [textord/linefind.cpp]

    I never try this function successfully. May be the image needs to satisfy some conditions.

Filter connected component

This step generate all the connected components and filter the noise blobs.

  • 调用栈

    1. main [api/tesseractmain.cpp] ->
    2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
    3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
    4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
    5. TessBaseAPI::FindLines [api/baseapi.cpp] ->
    6. Tesseract::SegmentPage [ccmain/pagesegmain.cpp] ->
    7. Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] ->
    8. Tesseract::SetupPageSegAndDetectOrientation [ccmain/ pagesegmain.cpp] ->
    9. (i) Textord::find_components [textord/tordmain.cpp] ->
    {
        extract_edges[textord/edgblob.cpp] //extract outlines and assign outlines to blobs
        assign_blobs_to_blocks2[textord/edgblob.cpp] //assign normal, noise, rejected blobs to TO_BLOCK_LIST for further filter blobs operations
        Textord::filter_blobs[textord/tordmain.cpp] ->
        Textord::filter_noise_blobs[textord/tordmain.cpp] //Move small blobs to a separate list
    }
    

    (ii) ColumnFinder::SetupAndFilterNoise [textord/colfind.cpp]

    This step will generate the intermediate result, refer to http://blog.csdn.net/kaelsass/article/details/46874627

Finding candidate tab-stop components

  • 调用栈

    1. main [api/tesseractmain.cpp] ->
    2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
    3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
    4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
    5. TessBaseAPI::FindLines [api/baseapi.cpp] ->
    6. Tesseract::SegmentPage [ccmain/pagesegmain.cpp] ->
    7. Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] ->
    8. ColumnFinder::FindBlocks [textord/ colfind.cpp] ->
    9. TabFind::FindInitialTabVectors[textord/tabfind.cpp] ->
    10. TabFind::FindTabBoxes [textord/tabfind.cpp]

    This step finds the initial candidate tab-stop CCs by a radial search starting at every filtered CC from preprocessing. Results can refer to http://blog.csdn.net/kaelsass/article/details/46874627

Finding the column layout

  • 调用栈

    1. main [api/tesseractmain.cpp] ->
    2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
    3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
    4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
    5. TessBaseAPI::FindLines [api/baseapi.cpp] ->
    6. Tesseract::SegmentPage [ccmain/pagesegmain.cpp] ->
    7. Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] ->
    8. ColumnFinder::FindBlocks [textord/ colfind.cpp] ->
    9. ColumnFinder::FindBlocks (begin at line 369) [textord/ colfind.cpp]

    This step finds the column layout of the page

Finding the regions

  • 调用栈

    1. main [api/tesseractmain.cpp] ->
    2. TessBaseAPI::ProcessPages [api/baseapi.cpp] ->
    3. TessBaseAPI::ProcessPage [api/baseapi.cpp] ->
    4. TessBaseAPI::Recognize [api/baseapi.cpp] ->
    5. TessBaseAPI::FindLines [api/baseapi.cpp] ->
    6. Tesseract::SegmentPage [ccmain/pagesegmain.cpp] ->
    7. Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] ->
    8. ColumnFinder::FindBlocks [textord/ colfind.cpp]

    This step recognizes the different type of blocks

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

友情链接更多精彩内容