SA-Siam:A Twofold Siamese Network for Real-Time Object Tracking

标题：A Twofold Siamese Network for Real-Time Object Tracking

作者：Anfeng He, Chong Luo, Xinmei Tian, Wenjun Zeng.

出处：CVPR2018

领域：单目标跟踪

【code】: 尝试复现论文效果，项目doing。欢迎讨论和交流。

new iders. 两个siameseFC，channel attention

why work? 1、deep representation combine(utilize heterogeneous features)(比它的baseline siameseFC效果好的最主要原因)；2、大量的训练数据，ImageNet；3、large search regions;

Abstract:

作者发现：图像分类任务的语义特征Semantic features，图像相似性匹配的表观特征Appearance feature，具有互补的性质。两个分支S_SiameseNet和A-SiameseNet都是基于siameseFC结构，分开训练。其中A-Net和SiameFC基本相似；S-Net中使用了通道注意力机制。

1. Introduction

The key to design a high-performance tracker is to find expressive features and corresponding calssifiers that are simultaneously discriminative and generalized. Being discriminative allows the tracker to differentiate the true target from the cluttered or even deceptive background. Being generalized means that a tracker would tolerate the appearance changes of the tracked object, even when the object is not known a priori.

跟踪算法的判别能力：能够将目标从复杂（杂斑、欺骗性的）背景中区分出来；

跟踪算法的泛化能力：能够应对目标的表观变化。

To siameFC, the generalization capability remains quite poor and it encounters difficulties when the target has significant appearance change. As a result, SiameFC still has a performance gap to the best online tracker. As a result, SiamFC still has a performance gap to the best online tracker.

siameFC的泛化能力较差：当目标发生较大的表观变化时，就会漂移。所以论文的目的，improve siameFC的泛化能力generalization capability。

It is widely understood that, in a deep CNN trained for image classification task, features from deeper layers contain stronger semantic information and is more invariant to object appearance changes. These semantic features are an ideal complement to the appearance features trained in a similarity learning problem

大家都知道widely understood that，来自图像分类任务的预训练CNN的高层特征较强的语义信息，对目标表观变化具有不变性（当目标变形时，这个特征仍然代表这个目标）。

For the semantic branch, we further propose a channel attention mechanism to achieve a minimum degree of target adaptation. The motivation is that different objects activate different sets of feature channels. We shall give higher weights to channels that play more important roles in tracking specific targets. This is realized by computing channel-wise weights based on the channel responses at the target object and in the surrounding context. This simplest form of target adaptation improves the discrimination power of the tracker.

有些特征通道channel（注意是特征通道，而不是特征）对某些特定的跟踪目标是很有用的，而另一些对该跟踪目标的基本没什么作用；所以应该give higher weights to channels that play more important roles in tracking specific targets.

小结：

1、SiameFC有一个不足，就是当目标表观发生极大变化，容易跟丢。而目标的语义特征对目标的表观变化具有不变性。两者结合可以互补。

2、不同特征通道，对特定的跟踪目标的判别能力不同。有些特征通道对于跟踪某些目标很重要，而有些通道对跟踪这些目标基本不起作用。

2. Related Work

2.1. Siamese Network Based Trackers

A notable advantage of this method is that it needs no or little online training. Thus, real-time tracking can be easily achieved.

The advantage of a fullyconvolutional network is that, instead of a candidate patch of the same size of the target patch, one can provide as input to the network a much larger search image and it will compute the similarity at all translated sub-windows on a dense grid in a single evaluation.

Significantly better performance is achieved without much speed drop.

SA-Siam inherits network architecture from SiamFC. We intend to improve SiamFC with an innovative way to utilize heterogeneous features.

2.2. Ensemble Trackers

A common insight of these ensemble trackers is that it is possible to make a strong tracker by utilizing different layers of CNN features. Besides, the correlation across models should be weak. In SA-Siam design, the appearance branch and the semantic branch use features at very different abstraction levels. Besides, they are not jointly trained to avoid becoming homogeneous.

2.3. Adaptive Feature Selection

不同特征对不同的跟踪目标的不同的影响，使用单一对象跟踪的所有特性既不高效也不有效。Recently, SENet demonstrates the effectiveness of channel-wise attention on image recognition tasks。

In our SA-Siam network, we perform channel-wise attention based on the channel activations. It can be looked on as a type of target adaptation, which potentially improves the tracking performance.

3. Our Approach

The fundamental idea behind this design ：相似性学习的表观特征和分类任务的语义特征具有互补性质。他们发现了。

3.1 SA-Sia Network Architecture

The two branches are separately trained and not combined until testing time.

The appearance branch

类似于siameseFC.

The semantic branch：

pretrained CNN(ALexNet)、conv4/conv5、fusion module(1 X 1 ConvNet)、crop operation、attention module.

we only train the fusion module and the channel attention module.

During testing time

按权重结合two branches产生的响应图。Similar to SiamFC，use multi-scale changes. find that using three scales strikes a good balance between performance and speed.

3.2 Channel Attension in Semantic Branch

高层语义特征对目标的表观变化鲁棒，因此使跟踪算法more generalized，但是less discriminative，定位不准。为了提高semantic branch的discriminative power，设计了通道注意力机制。

直观上，不同通道在跟踪不同目标中扮演不同的角色。一些通道对跟踪某些目标极其重要，但是在跟踪另一些目标时却是可有可无。If we could adapt the channel importance to the tracking target, we achieve the minimum functonality of target adaptation。In order to do so,不仅与目标有关，而且目标的背景区域也很重要。Therefore，the proposed attention module 的输入不是目标本身，而是包含背景信息比目标区域更大的区域。

以conv5特征图为例。该特征图的大小是22X22。

首先将特征图分为3X3网格，中间一块为6X6大小，与目标区域一样大。

然后，在每个网格上做max pooling。

再次，使用两层的多层感知机（MLP）为这个通道产生一个系数。

最后，使用带有bias的sigmoid函数，生成最后的参数。

3.3. Discussions of Design Choices

We separately train the two branches.

We do not fine-tune S-Net.

We keep A-Net as it is in SiameFC.

4. Experiments

4.1. Implementation Details

Network structure：A-Net和SiamseFC的网络结构exactly一样。S-Net采用imageNet上预训练的AlexNet；对stride做一点小的改变，使S-Net的输出和A-Net有相同的大小。

在注意力模块中，池化后的特征stack into 9维vector。The following MLP有一个有9个神经元的隐藏层，使用了ReLU非线性函数。最后在使用Sigmoid函数，使用的bias为0.5。this is to ensure that no channel will be suppressed to zero。

Data dimensions:

input:127*127*3、255*255*3。

output:6*6*256、22*22*256.

conv4:24*24*384.

conv5:22*22*256.

response maps :17*17.

Training:

ILSVRC-2015，只使用Color images。tensorflow。测试的平均速度是50fps.

Hyperpatrameters:

conbine weight = 0.3。 three scales。

4.2. Datasets and Evaluation Metrics

OTB:

VOT:

4.3. Ablation Analysis

The semantic branch and the appearance branch complement each other.

Using multilevel features and channel attention bring gain.

Separate vs. joint training.

4.4. Comparison with State-of-the-Arts

OTB benchmarks.

VOT2015 benchmark.

VOT2016 benchmark.

VOT2017 benchmark.

5. Conclusion

In the feature, we plan to continue exploring the effective fusion of deep feature in object trcking task.