题目：Understanding and Diagnosing Visual Tracking Systems

来源：ICCV15

论文主页（有matlab代码）：http://winsty.net/tracker_diagnose.html

这篇文章作为目标跟踪的入门级文章，读完这篇文章后，对目标跟踪这个领域有了大致的了解。

文章首先深度剖析了现有视频追踪系统，主要是将目标跟踪系统拆分为多个独立的部分进行分析，比较各个部分的效果。然后提出了一个基于HOG和Logistic Regression的简易算法，算法虽然简易，但在现有benchmark上都得到了接近state-of-the-art的结果。

We all knows that a tracking system usually works by initializing the observation model with

the given bounding box of the target in the first frame. In each of the following frames, the

motion model first generates candidate regions or proposals for testing based on the estimation from the previous frame. The candidate regions or proposals are fed into the observation model to compute their probability of being the target. The one with the highest

probability is then selected as the estimation result of the current frame. Based on the output of the observation model, the model updater decides whether the observation model needs any

update and, if needed, the update frequency. Finally, if there are multiple trackers, the bounding boxes returned by the trackers will be combined by the ensemble post-processor to obtain a more accurate estimate.

Five parts:

1.Motion Model: Based on the estimation from the previous frame, the motion model generates a set of candidate regions or bounding boxes which may contain the target in the current frame.

2.Feature Extractor: The feature extractor represents each candidate in the candidate set using some features.

3.Observation Model:The observation model judges whether a candidate is the target based on the features extracted from the candidate.

4.Model Updater:The model updater controls the strategy and frequency of updating the observation model. It has to strike a balance between model adaptation and drift.

5.Ensemble Post-processor:When a tracking system consists of multiple trackers, the ensemble post-processor takes the outputs of the constituent trackers and uses the ensemble learning approach to combine them into the final result.

Details of all these components

1.Motion Model:Particle Filter

Sliding ,Window Radius ,Sliding Window

2.Feature Extractor: Raw Grayscale ,Raw Color , Haar-like Features, HOG ,HOG +Raw Color

3.Observation Model:Logistic Regression ,Ridge Regression ,SVM ,Structured Output SVM (SO-SVM)

4.Model Updater:

①update the model whenever the confidence of the target falls below a threshold.

②update

the model whenever the difference between the confidence of the target and that of the background examples is below a threshold.

5.Ensemble post-processing:

①a loss function for bounding box majority voting and then extended it to incorporate tracker weights, trajectory continuity and removal of bad trackers

②a factorial hidden Markov model that considers the temporal smoothness between frames.

目前跟踪系统的评估系统主要有Online tracking benchmark (OTB)和the Visual object tracking challenge (VOT)。这些评估系统主要评估跟踪系统的准确率和鲁棒性。但是，这些评价标准却很难理解和诊断跟踪系统的强项与弱点。针对上述问题，作者提出一个框架，用于理解与诊断跟踪系统，也就是说能知道跟踪系统的强与弱究竟表现在哪儿？

作者将一个跟踪系统分成了五个部分：motion model，feature extractor，observation model，model updater和ensemble post-processor。

文章的重点内容大致如下，此处参考：

http://blog.csdn.net/u010515206/article/details/53406721

http://blog.csdn.net/hjl240/article/details/52225988

精华整合区

度量一个跟踪算法的好坏：准确性与鲁棒性

准确性是通过预测的位置区域与真实位置区域之间的重合率来度量的

鲁棒性是通过跟踪失败的频率来判断的

这篇文章并不是去重新建立一个新的评估体系，而是将跟踪系统拆分成5个不同的部分来分别进行评估操作。

Motion Model：当对一帧图像进行估计时，需要在当前帧中产生一系列的可能会包含目标的候选区域

Feature Extractor：对每个候选区域提取特征，用这些特征来表征这些候选区域

Observation Model：对候选区域的的特征进行分析，来确定该区域是否为目标区域

Model Updater：用来更新Observation Model，控制更新的策略以及何时更新

Ensemble Post-processor: 当一个跟踪系统中含有多个跟踪器时，要对多个跟踪器的跟踪结果进行一个组合分析，得到最终的跟踪结果

跟踪系统的运作流程

首先给定第一帧图像以及图像上的目标区域，初始化observation model,然后对下一帧图像先用motion model来产生一系列的候选区域；提取出候选每个区域的feature；利用observation model来计算每个候选区域是目标区域的概率，概率最大的那个就被选作目标位置区域；同时基于observation model的输出model updater判断是否要更新observation model，如果需要，更新频率是什么；最后如果是多个跟踪器融合的方案，还要将每个跟踪器的输出进行后处理来得到最终输出。

作者采用两种标准来进行度量。一种是重合率曲线，给定一个重合率的阈值来判断当前帧跟踪是否成功，就可以得到一个成功率随阈值变化的曲线,曲线下的面积就是AUC的评估值。在每一帧中，追踪系统预测出的目标物体位置区域A与真实的目标物体位置B，重合率a=（A∩B）/（A∪B），设定一个阈值（0~1之内变化），当重合率大于阈值时，则该帧为Success，对于整个视频所有的帧，我们便可以计算出成功率（Success rate）。另一种是中心像素位置误差曲线，通过计算估计位置与实际位置中心的像素距离，用给定的距离阈值来判断是否跟踪成功。

值得注意的地方来啦！！对上述5个部分分别进行了对比实验。

Feature Extractor

作者采用了如下5种特征进行实验。

1. Raw Grayscale（未处理的灰度图）：简单地将图像重新调整到固定大小，然后将其转换为灰度图，然后使用像素值作为特征。

2. Raw Color（未处理的彩色图）：图像采用CIE Lab颜色空间作为特征。

3. Haar-like Features（类Haar特征）：

4. HOG

5. HOG + Raw Color

可以看出“HOG + Raw Color”为特征的追踪系统表现最好。

结论：在一个追踪系统中，特征是非常重要的。使用好的特征可以提高追踪系统的准确率。如何提取出一个具有强表现力的特征目前还需要更加深入的研究。

Observation Model

采用了如下4个模型进行对比实验。

1. Logistic Regression

2. Ridge Regression

3. SVM

4. Structured Output SVM (SO-SVM)

可以看出，在使用弱特征时，一个强大分类器（比如说SO-SVM）对追踪系统的提高作用比较大，但是在使用一个强特征时，上图可以看出，使用Logistic regression作为分类器的效果最好。

结论：当特征比较弱时，分类的好坏对最后结果的影响比较大；当特征比较强时，分类器的好坏对最后的结果影响较小，甚至用简单的分类器便可以达到很好的效果。

Motion Model

采用了如下3个模型进行实验。

1. Particle Filter（粒子滤波）

2. Sliding Window

3. Radius Sliding Window

particle filter与sliding window两者的区别主要有以下两点：第一，particle filter可以维持每一帧的概率估计。因此当若干个候选区域有高的概率成为目标时，它们将会被保存用于接下来的帧。该方法对跟踪失败后重新跟踪是有帮助的。而sliding window只选择最高概率的那个候选区域框，并舍弃其他的候选区域框。第二，particle filter框架可以更容易处理如尺度变化，长宽比变化，甚至旋转和歪斜等情况。

换言之，粒子滤波可以保存每一帧的估计概率，当有多个候选区都有很高概率为目标区域时，它们会被为下一帧图像保留下来，用于跟踪出错后的恢复；滑动窗口只选择概率最大的候选区域，删除其他的。粒子滤波也容易处理尺度变化、长宽比变化甚至是旋转以及倾斜的情况，而滑动窗口由于需要大量的计算开销，所以很难处理这些情况。

particle filter对尺度变化的处理能力比较好，但是对快速运动的情况处理能力比较差。particle

filter可以在两种情况下同时处理好么？为了回答这个问题，作者首先查看了particle filter中的translation参数（控制追踪器的搜索区域），当搜索区域太小的时候，追踪器在快速移动的情况下容易丢失目标；当搜索区域大的时候，容易使追踪器产生漂移（由于背景的干扰）。另外，作者也注意到一个问题，在参数的设置上，是以像素的个数为单元的，这样，由于不同的视频有不同的分辨率，使用绝对的像素个数可能会造成不同的搜索区域。一个简单的解决方法就是通过视频的分辨率比例化参数，相当于重新调整视频的大小到一些固定的尺度。

可以看出，一个简单正规化步骤可以提高效果，尤其是在快速移动的情况下。通过这么一个简单的操作，particle filter可以很好的同时处理尺度变化和快速移动的情况。同时也验证了motion model的参数应该与视频的分辨率相适应。

结论：motion model对于追踪系统有一个比较小的影响。然而，合适地设置参数对获得一个好的效果是非常重要的。

Model Updater

一般来说，由于每个observation model的更新都是不同的，所以model updater一般说明model update什么时候应该做，还有其频率。

作者采用了如下两种模型：

1. 只要当目标的置信度低于阈值时，就更新模型。这样做的目的是确保目标总是有高的置信度。这个也是作者实验的默认的更新模型。

2. 当目标的置信度与背景置信度的差值低于阈值时，更新模型。这种技巧简单的维持了正样本与负样本之前较大的差距，而不是使目标有一个比较高的置信度。这个方法对遮挡或者目标消失的情况比较有帮助。

可以看出，不同的阈值，对结果的影响超过了10%。两种方法的最好的结果都差不多，但是模型二好的结果的范围比较宽一点。

结论：尽管模型的更新经常被视为工程上的技巧（treated as engineering tricks），但是对结果的影响仍然是很重要的，需要更加深入的研究。

Ensemble Post-processor

从上面的分析中可以看出来，单个追踪器有时候由于参数的设置等原因，结果会变得不是很稳定。而ensemble post-processor这部分就可以解决这一局限。

结论：ensemble post-processor模块可以提高追踪系统的效果，尤其是当追踪器多样性比较高的时候

结论

文章中通过将系统分成多个组成部分来进行详细分析，确定不同部分对于跟踪效果的影响大小。我们发现，即使是使用一些课本上非常基础的各个组件的组合，只要能仔细的设计各个组件，依然能得到state-of-art的跟踪效果。

1. 特征的提取在追踪系统中是最重要的

2. 如果特征很强大，那么observation model其实并没什么那么重要。

3. motion model、model updater和ensemble post-processor对追踪系统也有一定的影响，研究好这三个模块也很重要。

启发：如何寻找轻量且有效的特征表现，principled的模型更新策略，以及更先进的组合方式

论文笔记-Understanding and Diagnosing Visual Tracking Systems