最近在做公式检测的项目。从本文开始介绍一个用于扫描式公式检测的方法,我们将分为三个部分介绍。第一、解读论文。第二、开源代码详解。第三、在本地数据集上的复现。论文地址和代码地址分别为:论文地址,代码地址。
在上一篇文章中,文章地址:公式检测--ScanSSD摘要,我们简单介绍了这篇论文的摘要部分。在本文中,我们将继续对INTRODUCTION部分进行解读。
SCANSSD:WINDOW-LEVEL DETECTION
First, we use a sliding window to sample overlapping sub-images from the document page image. We then pass each window to a Single-Shot Detector (SSD [3]) to locate formula regions.
翻译:sample overlapping sub-images(采样重叠子图像)
解读:首先从文档页面图像中采样重叠子图像。然后再把每个子图像都通过SSD网络,进行公式区域的定位。
SSD simultaneously evaluates multiple formula regioncandidates laid out in a grid (see Figure 3), and then appliesnon-maximal suppression (NMS) to select the window-leveldetections.
翻译:simultaneously(同时),laid out in a grid(网格),non-maximal suppression(非极大值抑制)。
SSD同时计算网格中的多个公式候选区域,然后用非极大值抑制算法选择window-level detections。
解读:window-level detections的意思大概为窗口中的公式区域。
A. Sliding Windows
To produce sub-images for use in detection, starting from a 600 dpi page image we slide a 1200 × 1200 window with a vertical and horizontal stride (shift) of 120 pixels (10% of window size).
翻译:为了提取用于检测的子图像,我们用一个1200*1200的滑动窗口,以120像素为间隔,在原图上进行滑动。
解读:120像素为间隔的意思是,每个滑动窗口的大小为1200*1200,滑窗的间隔为120像素点。
Our windows are roughly 10 text lines in height, which makes math formulas large enough for SSD to detectthem reliably. The SSD detector is trained using ground truthmath regions cropped at the boundary of each window, afterscaling and translating formula bounding boxes appropriately.
翻译:cropped(不规则裁剪)。
在适当缩放和转换公式边界框之后,SSD通过公式的实际框(在每个窗口的边界处裁剪)训练。
解读:ground truth不是事先知道的吗?为什么还需要缩放和转换公式边界框。
Advantages. There are four main advantages to usingsliding windows.The first is data augmentation: only 569page images are available in the training set, which is verysmall for training a deep neural network. Our slidingwindows produce 656,717 sub-images.
翻译:augmentation(增强)。
解读:使用滑动窗口有四个优点,第一个优点是可以进行数据增强:原始的训练数据集只有569张,通过滑动窗口后,生成了656717张图片。
Second, convertingthe original page image directly to 300 × 300 or 512 × 512loses a great deal of visual information, and when we triedto detect formulas using subsampled page images recall wasextremely low.
翻译:visual information(视觉信息),subsampled(下采样)。
第二个优点:当我们把原始图片直接缩放为300*300或者512*512后,会丢失很多的视觉信息。使用这种通过下采样获取的图片,会导致公式检测的召回率很低。
解读:下采样的意思是,如果一幅图片的尺寸为M*N,对其进行s倍下采样,可以得到(M/s)*(N/s)的图片。
Third, as we maintain the overlap betweenwindows, the network sees formulas multiple times, and hasmultiple chances to detect a formula. This helps increaserecall, because formulas appear in more regions of detectionwindows.
翻译:maintain(保持)。
第三个优点:由于我们保持了窗口之间的重叠,对于同一个公式,可能出现在不同的窗口,所以网络可以学习很多次,从而有很多次机会把公式检测出来。
Finally, Liu et al. [3] mention that SSD ischallenged when detecting small objects. Formulas with justone or two characters are common, but also small. Usinghigh-resolution sub-images increases the relative size ofmath regions, which makes it easier for SSD to detect them.
翻译:high-resolution(高分辨率)。
第四个优点:对于小的目标,用SSD检测可能有点挑战。有很多公式只有一两个字符。用高分辨率的子图像可以相对的增大公式区域,这有助于SSD的检测。
解读:用高分辨率的子图像可以相对的增大公式区域,感觉是因为公式出现在了很多窗口中,从而相对的增大公式区域。
Disadvantages. There are also a few disadvantages tousing sliding windows versus detection within a single pageimage. The first is increased computational cost; this can bemitigated through parallelization, as each window may beprocessed independently.
翻译:increased computational cost(增加计算成本),mitigated through parallelization(通过并行化缓解)。
第一个缺点:增加计算成本(耗时),这可以通过并行化缓解,因为每个窗口可以被独立处理。
解读:并行化处理图片应该可以降低计算耗时。
Secondly, windowing cuts formulasif they do not fit in a window. This means that a largeexpression may be split into multiple sub-images; this makesit impossible to train the SSD network to detect large mathexpressions directly. To mitigate this issue, we train thenetwork to detect formulas across windows. Furthermore,windowing requires that we stitch (combine) results fromindividual windows to obtain detection results at the level ofthe original page. We discuss how we address theseproblems using pooling methods in section V.
翻译:To mitigate this issue(为了缓解这个问题)。
第二个缺点:一个较大的公式可能被切分到不同的子窗口中。为了解决这个问题,我们需要训练模型通过跨窗口检测公式。此外,我们还需要把独立窗口的结果结合起来,从而获得页面级别的检测结果。
解读:训练模型通过跨窗口检测公式,如何操作呢?需要在下文中找答案。
B. Region Matching and Default Boxes in SSD
SSD defines a fixed space of candidate detection regionsorganized in a spatial grid at multiple resolutions (‘defaultboxes’).
翻译:暂时看不太懂
Each default box may be resized and translated bythe SSD network to fit target regions, and is associated witha confidence score.
翻译:default box(候选框),target regions(目标区域)。
每个候选框可以被SSD网络resize and translated以便更靠近目标区域,并且和目标区域有一个置信度。
解读:resize可以理解为把调整候选框的形状,translated这个不太好理解。
Figure 1 shows default boxes of differentsizes and aspect ratios overlaid on a 512×512 image.
翻译:aspect ratios(纵横比)
图3显示了覆盖在512*512大小的图片上的不同尺寸和纵横比的候选框。
In SSD,each feature map is a pixel grid, but the associated defaultboxes are defined in the original image coordinate space.
翻译:在SSD中,每个特征图都是一个像素网格,但是相关的候选框在原始图像坐标空间中。
Theimage is analyzed at multiple scales; here for illustration the32 × 32 grid of default boxes is shown. In practice, if weused only the 32 × 32 default boxes, we might miss smallerobjects.
翻译:illustration(插图)
候选框可以有很多种尺寸,图3中展示的是32*32这种尺寸的。在实际场景中,如果只用32*32这种尺寸的,我们可能会漏掉很多更小的目标。
Our metric for matching ground truth to candidatedetection regions is the same as SSD [3]. Each ground truthbox is matched to a default box with the highest IOU, andalso with default boxes with an IOU greater than 0.5.Matching targets to more than one default box simplifieslearning by allowing the network to predict higher scores formore boxes. The matched default boxes are consideredpositive examples (POS) and the remaining default boxes areconsidered negative examples (NEG).
翻译:大致意思是,我们用候选框去匹配真实框的策略和SSD差不多。如果IoU高于0.5就被认为是POS。
The original SSD [3] architecture uses aspect ratios(width/height) of {1, 2, 3, 1/2, 1/3}. However, as we see inFigure 4, there are many wide formulas with an aspect ratiogreater than 3 in the dataset. As a result, wider default boxeswill have a higher chance of matching wide formulas. So, inaddition to the default boxes used in the original SSD, wealso add the wider default boxes used in TextBoxes [29],with aspect ratios {5, 7, 10}. In our early experiments, thesewider default boxes increased recall for large formulas.
翻译:大致意思是,原始的SSD,候选框的纵横比为:{1, 2, 3, 1/2, 1/3}。在本文中,因为有些公式的纵横比较大,所以又增加了{5, 7, 10}这三种类型。
C. Postprocessing
Figure 2 illustrates postprocessing in ScanSSD. Weexpand and/or shrink initial formula detections so that arecropped around the connected components they contain andtouch at their border. The goal is to capture entire charactersbelonging to a detection region, without additional padding.
翻译:如图2中,我们会把检测框扩大或者收缩。目标是捕获属于检测区域的整个字符,而无需额外填充。
解读:如上图所示的0,一开始只有一部分在检测框内,调整后,全部处于了检测框内。
This postprocessing is done at two stages: first, beforestitching, and second, after pooling regions to obtain outputformula detections.
翻译:上面的处理,需要在把整张图片缝合起来之前,并且在合并区域获得输出公式之后进行。
解读:(1).同一个公式可能处于不同的滑动窗口,检测出滑动窗口的公式之后,需要把多个滑动窗口的公式合起来。(2).检测出每一个公式之后,需要把整张图片的公式都整合起来。那么上面的处理需要在(1)之后,(2)之前,进行。