First, we use a sliding window to sample overlapping sub-images from the document page image. We then pass each window to a Single-Shot Detector (SSD [3]) to locate formula regions.
翻译:sample overlapping sub-images(采样重叠子图像)
SSD simultaneously evaluates multiple formula regioncandidates laid out in a grid (see Figure 3), and then appliesnon-maximal suppression (NMS) to select the window-leveldetections.
翻译:simultaneously(同时),laid out in a grid(网格),non-maximal suppression(非极大值抑制)。
SSD同时计算网格中的多个公式候选区域,然后用非极大值抑制算法选择window-level detections。
解读:window-level detections的意思大概为窗口中的公式区域。
A. Sliding Windows
To produce sub-images for use in detection, starting from a 600 dpi page image we slide a 1200 × 1200 window with a vertical and horizontal stride (shift) of 120 pixels (10% of window size).
Our windows are roughly 10 text lines in height, which makes math formulas large enough for SSD to detectthem reliably. The SSD detector is trained using ground truthmath regions cropped at the boundary of each window, afterscaling and translating formula bounding boxes appropriately.
解读:ground truth不是事先知道的吗?为什么还需要缩放和转换公式边界框。
Advantages. There are four main advantages to usingsliding windows.The first is data augmentation: only 569page images are available in the training set, which is verysmall for training a deep neural network. Our slidingwindows produce 656,717 sub-images.
Second, convertingthe original page image directly to 300 × 300 or 512 × 512loses a great deal of visual information, and when we triedto detect formulas using subsampled page images recall wasextremely low.
翻译:visual information(视觉信息),subsampled(下采样)。
Third, as we maintain the overlap betweenwindows, the network sees formulas multiple times, and hasmultiple chances to detect a formula. This helps increaserecall, because formulas appear in more regions of detectionwindows.
Finally, Liu et al. [3] mention that SSD ischallenged when detecting small objects. Formulas with justone or two characters are common, but also small. Usinghigh-resolution sub-images increases the relative size ofmath regions, which makes it easier for SSD to detect them.
Disadvantages. There are also a few disadvantages tousing sliding windows versus detection within a single pageimage. The first is increased computational cost; this can bemitigated through parallelization, as each window may beprocessed independently.
翻译:increased computational cost(增加计算成本),mitigated through parallelization(通过并行化缓解)。
Secondly, windowing cuts formulasif they do not fit in a window. This means that a largeexpression may be split into multiple sub-images; this makesit impossible to train the SSD network to detect large mathexpressions directly. To mitigate this issue, we train thenetwork to detect formulas across windows. Furthermore,windowing requires that we stitch (combine) results fromindividual windows to obtain detection results at the level ofthe original page. We discuss how we address theseproblems using pooling methods in section V.
翻译:To mitigate this issue(为了缓解这个问题)。
B. Region Matching and Default Boxes in SSD
SSD defines a fixed space of candidate detection regionsorganized in a spatial grid at multiple resolutions (‘defaultboxes’).
Each default box may be resized and translated bythe SSD network to fit target regions, and is associated witha confidence score.
翻译:default box(候选框),target regions(目标区域)。
每个候选框可以被SSD网络resize and translated以便更靠近目标区域,并且和目标区域有一个置信度。
Figure 1 shows default boxes of differentsizes and aspect ratios overlaid on a 512×512 image.
翻译:aspect ratios(纵横比)
In SSD,each feature map is a pixel grid, but the associated defaultboxes are defined in the original image coordinate space.
Theimage is analyzed at multiple scales; here for illustration the32 × 32 grid of default boxes is shown. In practice, if weused only the 32 × 32 default boxes, we might miss smallerobjects.
Our metric for matching ground truth to candidatedetection regions is the same as SSD [3]. Each ground truthbox is matched to a default box with the highest IOU, andalso with default boxes with an IOU greater than 0.5.Matching targets to more than one default box simplifieslearning by allowing the network to predict higher scores formore boxes. The matched default boxes are consideredpositive examples (POS) and the remaining default boxes areconsidered negative examples (NEG).
The original SSD [3] architecture uses aspect ratios(width/height) of {1, 2, 3, 1/2, 1/3}. However, as we see inFigure 4, there are many wide formulas with an aspect ratiogreater than 3 in the dataset. As a result, wider default boxeswill have a higher chance of matching wide formulas. So, inaddition to the default boxes used in the original SSD, wealso add the wider default boxes used in TextBoxes [29],with aspect ratios {5, 7, 10}. In our early experiments, thesewider default boxes increased recall for large formulas.
翻译:大致意思是,原始的SSD,候选框的纵横比为:{1, 2, 3, 1/2, 1/3}。在本文中,因为有些公式的纵横比较大,所以又增加了{5, 7, 10}这三种类型。
C. Postprocessing
Figure 2 illustrates postprocessing in ScanSSD. Weexpand and/or shrink initial formula detections so that arecropped around the connected components they contain andtouch at their border. The goal is to capture entire charactersbelonging to a detection region, without additional padding.
This postprocessing is done at two stages: first, beforestitching, and second, after pooling regions to obtain outputformula detections.