- 一句话介绍本文方法:
- a robust visual tracking method by casting tracking as a sparse approximation problem in a particle filter framework.
- 怎么处理 occlusion 和 corruption?
- occlusion, corruption and other challenging issues are addressed seamlessly through a set of trivial templates.
- 怎么 detect the target?
- Specifically, to find the tracking target at a new frame, each target candidate is sparsely represented in the space spanned by target templates and trivial templates.
- The sparsity is achieved by solving an L1-regularized least squares problem.
- Then the candidate with the smallest projection error is taken as the tracking target.
- 怎么 tracking?
- After that, tracking is continued using a Bayesian state inference framework in which a particle filter is used for propagating sample distributions over time.
- 意思是本文方法是 detection before tracking 咯,那就是基于前景目标建模方法里面的判别式的咯?
- Two additional components further improve the robustness of our approach:
-
- the nonnegativity constraints that help filter out clutter that is similar to tracked targets in reversed intensity patterns
-
- a dynamic template update scheme that keeps track of the most representative templates throughout the tracking procedure.
- This is done by adjusting template weights by using the coefficients in the sparse representation
-
-
一句话概况本文方法
- develop a robust visual tracking frame work by casting the tracking problem as finding a sparse approximation in a template subspace
-
怎么处理 occlusion 和 corruption?
-
作者说是受 SRC 那篇 TPAMI 的启发的
-
handling occlusion using trivial templates, such that each trivial template has only one non- zero element (see Fig. 1)
-
Then, during tracking, a target candidate is represented as a linear combination of the template set composed of both target templates (obtained from previous frames) and trivial templates.
-
-
如何用 sparse coefficient vector 来辨别 target 还是 occlusion?
- The number of target templates are far fewer than the number of trivial templates.
- 如果是 target
- Intuitively, a good target candidate can be efficiently represented by the target templates.
- This leads to a sparse coefficient vector, since coefficients corresponding to trivial templates (named trivial coefficients) tend to be zeros.
- 如果是 occlusion
- In the case of occlusion (and/or other unpleasant issues such as noise corruption or background clutter), a limited number of trivial coefficients will be activated, but the whole coefficient vector remains sparse.
- A bad target candidate, on the contrary, often leads to a dense representation (e.g., Fig. 3).
- Candidates similar to trivial templates have sparse representations, but they are easily filtered out for their large dissimilarities to target templates.
- Tracking
- Early works uses a Kalman filter to provide solutions that are optimal for a linear Gaussian model.
- The particle filter, also known as the sequential Monte Carlo method, recursively constructs the posterior probability density function of the state space using Monte Carlo integration.
- 怎么看待 Tracking
- Tracking can be considered as finding the minimum distance from the tracked object to the subspace represented by the training data or previous tracking results
- 就是 generative model
- Tracking can also be considered as a classification problem and a classifier is trained to distinguish the object from the background
- 就是 discriminative model
- Tracking can be considered as finding the minimum distance from the tracked object to the subspace represented by the training data or previous tracking results
-
particle filter is a Bayesian sequential importance sampling technique for estimating the posterior distribution of state variables characterizing a dynamic system.
-
It provides a convenient framework for estimating and propagating the posterior probability density function of state variables regardless of the underlying distribution.
-
It consists of essentially two steps:
- prediction
- update
-
数学预备
- Let
$x_t$ denote the state variable describing the affine motion parameters of an object at time$t$ , The predicting distribution of$x_t$ given all available observations$z_{1:t-1} = { z_1, z_2, \cdots, z_{t-1}}$ up to time$t-1$ , denoted by$p(x_t|z_{1:t-1})$ is recursively computed as
- Let
-
具体到 particle filter
-
后验概率
$p(x_t|z_{1:t})$ approximated by a finite set of$N$ samples${ x^i_t }_{i = 1,\cdots, N}$ with importance weights$w_t^i$ -
The candidate samples
$x^i_t$ are drawn from an importance distribution$q(x_t|x_{1:t-1},z_{1:t})$ and the weights of the samples are updated as -
The samples are resampled to generate a set of equally weighted particles according to their importance weights to avoid degeneracy.
- 怎么产生 samples 的?
-
In the case of the bootstrap filter
$q(x_t|x_{1:t-1},z_{1:t}) = p(z_t|x_t)$ and the weights become the observation likelihood$p(z_t|x_t)$
-
-
具体到 tracking framework
- apply an affine image warping to model the object motion of two consecutive frames.
- The state variable
$x_t$ is modeled by the six parameters of the affine transformation parameters$x_t = (\alpha_1, \alpha_2, \alpha_3, \alpha_4, t_x, t_y)$ -
${\alpha_1, \alpha_2, \alpha_3, \alpha_4}$ are the deformation parameters -
$(t_x, t_y)$ are the 2D translation parameters
-
-
$z_t$ 是 region of interest -
state transition distribution
$p(x_t|x_{t-1})$ 是 a Gaussian distribution - assume the six parameters of the affine transformation are independent
-
observation model
$p(z_t|x_t)$ reflects the similarity between a target candidate and the target templates-
$p(z_t|x_t)$ 在上面也被叫做 observation likelihood - 在本文中
$p(z_t|x_t)$ is formulated from the error approximated by the target templates using$\ell_1$ minimization.
-
-
a tracking result
$\textbf{y} \in \mathbb{R}^d$ approximately lies in the linear span of$\textbf{T}$ where
$\textbf{a} = (a_1,a_2,\cdots,a_n)^T \in \mathbb{R}^n$ is called a target coefficient vector -
To incorporate the effect of occlusion and noise, 公式(4) is rewritten as
-
$\epsilon$ is the error vector – a fraction of its entries are nonzero. - The nonzero entries of
$\epsilon$ indicate the pixels in$\textbf{y}$ that are corrupted or occluded.
-
-
use trivial templates
$\textbf{I} = [\textbf{i}_1, \textbf{i}_2, ..., \textbf{i}_d] \in \mathbb{R}^{d \times d}$ to capture the occlusion as- a trivial template
$\textbf{i}_i \in \mathbb{R}^d$ is a vector with only one nonzero entry,也就是说公式(6) 中的$\textbf{I}$ 是一个单位矩阵(identity matrix) -
$\mathbf{e} = (e_1,e_2,\cdots, e_d)^T \in \mathbb{R}^d$ is called a trivial coefficient vector
- a trivial template
-
作者认为 a tracking target can almost always represented by the target templates dominated by nonnegative coefficients
- the templates that are most similar to the tracking target are positively related to the target
- 这些 templates 要保持更新
- nonnegative coefficients 还有一个好处是能够 filter out clutter that is similar to target templates at reversed intensity patterns, which often happens when shadows are involved
-
如果直接对公式(6)的系数施加非负约束的话,就强迫 trivial templates 也要正相关的,但实际上 trivial templates 也可能是负相关的,只有 target templates 才是正相关,作者对此的解决办法是添加 negative trivial templates,model (6) 可以重写为
-
求解稀疏问题
- 前 10 个 templates 是 target templates,后面 360 个 templates 是 trivial templates
- 如果是 good target candidate,那么就跟右上方这幅画一样,target templates 对应的系数非常突出
- 如果是 bad target candidate,那么就跟右下方一样,trivial templates 对应的系数非常突出
-
find the tracking result by finding the smallest residual after projecting on the target template subspace, i.e.,
$\Vert \textbf{y} - \textbf{Ta} \Vert_2$
- 为什么要 template update
- A fixed appearance template is not sufficient to handle recent changes in the video
- If we do not update the template, the template cannot capture the appearance variations due to illumination or pose changes.
- update too often 也不行
- If we update the template too often, small errors are introduced each time the template is updated.
- The errors are accumulated and the tracker drifts from the target
-
还给每个 target template 赋了一个权重,说实话,我没看懂
-
$\ell_1$ favors the template with larger norm because of the regularization part$\Vert \textbf{c} \Vert_1$ - The larger norm of
$\Vert \textbf{t}_i \Vert$ is, the smaller coefficient$a_i$ is needed in the approximation$\Vert \textbf{y-Bc}\Vert_2$ - 作者 exploit the characteristic by introducing a weight
$w_i = \Vert \textbf{t}_i \Vert_2$ associated with each template$\textbf{t}_i$ - Intuitively, the larger the weight is, the more important the template is.
- 感觉这个没有道理啊,照理说,所有样本都应该归一化才对啊
-
- 最初的 target templates 是怎么产生的?
- At initialization, the first target template is manually selected from the first frame and applied zero-mean-unit-norm normalization.
- 不是做了 unit norm 了么,为啥 $\Vert \textbf{t}_i \Vert$ 还会不一样?
- The rest target templates are created by perturbating one pixel in four possible directions at the corner points of the first template in the first frame.
- At initialization, the first target template is manually selected from the first frame and applied zero-mean-unit-norm normalization.
- template replacement
- If the tracking result
$\textbf{y}$ is not similar to the current template set$\textbf{T}$ , it will replace the least important template in$\textbf{T}$ and be initialized to have the median weight of the current templates. - 怎么判断是不是 similar 啊?
- If the tracking result
- template updating
- weight updating
- The weight of each template increases when the appearance of the tracking result and template is close enough and decreases otherwise.
- tempalte 跟当前的 target 相近,就增加权重
- 说白了就是一个 template 做字典,然后求解
$\ell_1$ minimization 感觉不难啊 - 对我来说新的信息量就是 trivial templates 的使用,以及 weight update,不过 weight update 感觉没看懂
- 还有的问题就是,你的 particle filter 怎么最后没有用啊?
@article{Mei2009RobustVT,
title={Robust visual tracking using ℓ1 minimization},
author={Xue Mei and Haibin Ling},
journal={2009 IEEE 12th International Conference on Computer Vision},
year={2009},
pages={1436-1443}
}