Modeling and Propagating CNNs in a Tree Structure for Visual Tracking

  1. Abstract
    We present an online visual tracking algorithm by managing multiple target appearance models in a tree structure. The proposed algorithm employs Convolutional Neural Networks (CNNs) to represent target appearances, where multiple CNNs collaborate to estimate target states and determine the desirable paths for online model updates in the tree. By maintaining multiple CNNs in diverse branches of tree structure, it is convenient to deal with multi-modality in target appearances and preserve model reliability through smooth updates along tree paths.

The final target state is estimated by sampling target candidates around the state in the previous frame and identifying the best sample in terms of a weighted average score from a set of active CNNs.

  1. Introduction
    Most existing tracking algorithms update the target model assuming that target appearances change smoothly over time. However, this strategy may not be appropriate for handling more challenging situations such as occlusion, illumination variation, abrupt motion and deformation, which may break temporal smoothness assumption. Some algorithms employ multiple models[38, 28, 40], multi-modal representations [11] or nonlinear classifiers [10, 12] to address these issues. However, the constructed models are still not strong enough and online model updates are limited to sequential learning in a temporal order, which may not be able to make the models sufficiently discriminative and diverse.

However, online learning with CNNs is not straightforward because neural networks tend to forget previously learned information quickly when they learn new information [27]. This property often incurs drift problem especially when background information contaminates target appearance models, targets are completely occluded by other objects, or tracking fails temporarily. This problem may be alleviated by maintaining multiple versions of target appearance models constructed at different time steps and updating a subset of models selectively to keep a history of target appearances. This idea has been investigated in [22], where a pool of CNNs are used to model target appearances, but it does not consider the reliability of each CNN to estimate target states and update models.

We propose an online visual tracking algorithm, which estimates target state using the likelihoods obtained from multiple CNNs. The CNNs are maintained in a tree structure and updated online along the path in the tree. Since each path keeps track of a separate history about target appearance changes, the proposed algorithm is effective to handle multi-modal target appearances and other exceptions such as short-term occlusions and tracking failures. In addition, since the new model corresponding to the current frame is constructed by fine-tuning the CNN that produces the highest likelihood for target state estimation, more consistent and reliable models are to be generated through online learning only with few training examples.

The main contributions of our paper are summarized below:
• We propose a visual tracking algorithm to manage target appearance models based on CNNs in a tree structure, where
the models are updated online along the path in the tree. This strategy enables us to learn more persistent models
through smooth updates.
• Our tracking algorithm employs multiple models to capture diverse target appearances and performs more robust
tracking even with challenges such as appearance changes, occlusions, and temporary tracking failures.

  1. Related Works
    Tracking-by-detection approaches formulate visual tracking as a discriminative object classification problem in a sequence of video frames. The techniques in this category typically learn classifiers to differentiate targets from surrounding backgrounds.

Tracking algorithms based on hand-crafted features [15, 38] often outperform CNN-based approaches. This is partly because CNNs are difficult to train using noisy labeled data online while they are easy to overfit to a small number of training examples; it is not straightforward to apply CNNs to visual tracking problems involving online learning. For example, the performance of [22], which is based on a shallow custom neural network, is not as successful as recent tracking algorithms based on shallow feature learning. However, CNN-based tracking algorithms started to present competitive accuracy in the online tracking benchmark [37] by transferring the CNNs pretrained on ImageNet [8].

Multiple models are often employed in generative tracking algorithms to handle target appearance variations and recover from tracking failures. Trackers based on sparse representation [28, 40] maintain multiple target templates to compute the likelihood of each sample by minimizing its reconstruction error while [21] integrates multiple observation models via an MCMC framework. Nam et al. [29] integrates patch-matching results from multiple frames and estimates the posterior of target state. On the other hand, ensemble classifiers have sometimes been applied to visual tracking problem. Tang et al. [32] proposed a co-tracking framework based on two support vector machines. An ensemble of weak classifiers is employed to estimate target states in [1, 3]. Zhang et al. [38] presented a framework based on multiple snapshots of SVM-based trackers to recover from tracking failures.

  1. Algorithm Overview
    Our algorithm maintains multiple target appearance models based on CNNs in a tree structure to preserve model consistency and handle appearance multi-modality effectively. The proposed approach consists of two main components as in ordinary tracking algorithms—state estimation and model update—whose procedures are illustrated in Figure 1. Note that both components require interaction between multiple CNNs.

When a new frame is given, we draw candidate samples around the target state estimated in the previous frame, and compute the likelihood of each sample based on the weighted average of the scores from multiple CNNs. The weight of each CNN is determined by the reliability of the path along which the CNN has been updated in the tree structure. The target state in the current frame is estimated by finding the candidate with the maximum likelihood. After
tracking a predefined number of frames, a new CNN is derived from an existing one, which has the highest weight among the contributing CNNs to target state estimation. This strategy is helpful to ensure smooth model updates and maintain reliable models in practice.

image.png

Our approach has something in common with [22], which employs a candidate pool of multiple CNNs. It selects k nearest CNNs based on prototype matching distances for tracking. Our algorithm is differentiated from this approach since it is more interested in how to keep multimodality of multiple CNNs and maximize their reliability by introducing a novel model maintenance technique using a tree structure. Visual tracking based on a tree-structured graphical model has been investigated in [13], but this work is focused on identifying the optimal density propagation path for offline tracking. The idea in [29] is also related, but it mainly discusses posterior propagation on directed acyclic graphs for visual tracking.

  1. Proposed Algorithm
    5.1 CNN Architecture
    Our network consists of three convolutional layers and three fully connected layers. The convolution filters are identical to the ones in VGG-M network [4] pretrained on ImageNet [8]. The last fully connected layer has 2 units for binary classification while the preceding two fully connected layers are composed of 512 units. All weights in these three layers are initialized randomly. The input to our network is a 75 × 75 RGB image and its size is equivalent to the receptive field size of the only single unit (per channel) in the last convolutional layer. Note that, although we borrow the convolution filters from VGG-M network, the size of our network is smaller than the original VGG-M network. The output of an input image x is a normalized vector [φ(x), 1 − φ(x)]T , whose elements represent scores for target and background, respectively.

5.2 Tree Construction
We maintain a tree structure to manage hierarchical multiple target appearance models based on CNNs. In the tree structure T = {V, E}, a vertex v ∈ V corresponds to a CNN and a directed edge (u, v) ∈ E defines the relationship between CNNs. The score of an edge (u, v) is the affinity between two end vertices, which is given by

image.png

where Fv is a set of consecutive frames that is used to train the CNN associated with v, x^∗_t is the estimated target state at frame t, and φu(·) is the predicted positive score with respect to the CNN in u.

5.3. Target State Estimation using Multiple CNNs

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,222评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,455评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 157,720评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,568评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,696评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,879评论 1 290
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,028评论 3 409
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,773评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,220评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,550评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,697评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,360评论 4 332
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,002评论 3 315
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,782评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,010评论 1 266
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,433评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,587评论 2 350

推荐阅读更多精彩内容

  • “诶,你看娱乐新闻没有,薛之谦这次演唱会票房相当惨淡呀,这种歌手声音辨识度不够,而且音域不广,唱法比较通俗..加上...
    张无畏阅读 450评论 2 0
  • ——当代大学毕业生就业生存实录 【目录】 从头读起:第一章 上一章 文丨春申君黄歇 在金融系蹭课的那段时间里,给A...
    春申君黄歇阅读 304评论 0 4
  • 周末接待了两个大学同学,一个正在中山大学读博,一个在上海大学读博,我一个屌丝本科在他们面前太弱了,四年不见,褚哥变...
    Alex789阅读 198评论 0 0
  • 对不起我没有听你的忠告 他对你好 就把一切都给他 然后越陷越深 他却越来越清醒 今晚是他的毕业晚会 结束后 很多人...
    谁抢了我名字阅读 94评论 0 0