Skip to content

Instantly share code, notes, and snippets.

@YimianDai
Last active July 30, 2019 05:31
Show Gist options
  • Save YimianDai/adaf12a2a3fbf680c0a0cd1a24b854c0 to your computer and use it in GitHub Desktop.
Save YimianDai/adaf12a2a3fbf680c0a0cd1a24b854c0 to your computer and use it in GitHub Desktop.
2009-ICCV-Robust visual tracking using L1 minimization

2009 - ICCV - Robust visual tracking using L1 minimization

摘要

  • 一句话介绍本文方法:
    • a robust visual tracking method by casting tracking as a sparse approximation problem in a particle filter framework.
  • 怎么处理 occlusion 和 corruption?
    • occlusion, corruption and other challenging issues are addressed seamlessly through a set of trivial templates.
  • 怎么 detect the target?
    • Specifically, to find the tracking target at a new frame, each target candidate is sparsely represented in the space spanned by target templates and trivial templates.
    • The sparsity is achieved by solving an L1-regularized least squares problem.
    • Then the candidate with the smallest projection error is taken as the tracking target.
  • 怎么 tracking?
    • After that, tracking is continued using a Bayesian state inference framework in which a particle filter is used for propagating sample distributions over time.
    • 意思是本文方法是 detection before tracking 咯,那就是基于前景目标建模方法里面的判别式的咯?
  • Two additional components further improve the robustness of our approach:
      1. the nonnegativity constraints that help filter out clutter that is similar to tracked targets in reversed intensity patterns
      1. a dynamic template update scheme that keeps track of the most representative templates throughout the tracking procedure.
      • This is done by adjusting template weights by using the coefficients in the sparse representation

Introduction

  • 一句话概况本文方法

    • develop a robust visual tracking frame work by casting the tracking problem as finding a sparse approximation in a template subspace
  • 怎么处理 occlusion 和 corruption?

    • 作者说是受 SRC 那篇 TPAMI 的启发的

    • handling occlusion using trivial templates, such that each trivial template has only one non- zero element (see Fig. 1)

      mark

    • Then, during tracking, a target candidate is represented as a linear combination of the template set composed of both target templates (obtained from previous frames) and trivial templates.

  • 如何用 sparse coefficient vector 来辨别 target 还是 occlusion?

    • The number of target templates are far fewer than the number of trivial templates.
    • 如果是 target
      • Intuitively, a good target candidate can be efficiently represented by the target templates.
      • This leads to a sparse coefficient vector, since coefficients corresponding to trivial templates (named trivial coefficients) tend to be zeros.
    • 如果是 occlusion
      • In the case of occlusion (and/or other unpleasant issues such as noise corruption or background clutter), a limited number of trivial coefficients will be activated, but the whole coefficient vector remains sparse.
    • A bad target candidate, on the contrary, often leads to a dense representation (e.g., Fig. 3).
      • Candidates similar to trivial templates have sparse representations, but they are easily filtered out for their large dissimilarities to target templates.

Related Work

  • Tracking
    • Early works uses a Kalman filter to provide solutions that are optimal for a linear Gaussian model.
    • The particle filter, also known as the sequential Monte Carlo method, recursively constructs the posterior probability density function of the state space using Monte Carlo integration.
  • 怎么看待 Tracking
    • Tracking can be considered as finding the minimum distance from the tracked object to the subspace represented by the training data or previous tracking results
      • 就是 generative model
    • Tracking can also be considered as a classification problem and a classifier is trained to distinguish the object from the background
      • 就是 discriminative model

Particle Filter

  • particle filter is a Bayesian sequential importance sampling technique for estimating the posterior distribution of state variables characterizing a dynamic system.

  • It provides a convenient framework for estimating and propagating the posterior probability density function of state variables regardless of the underlying distribution.

  • It consists of essentially two steps:

    • prediction
    • update
  • 数学预备

    • Let $x_t$ denote the state variable describing the affine motion parameters of an object at time $t$, The predicting distribution of $x_t$ given all available observations $z_{1:t-1} = { z_1, z_2, \cdots, z_{t-1}}$ up to time $t-1$, denoted by $p(x_t|z_{1:t-1})$ is recursively computed as

    mark

    • 公式(1)其实就是以前学过的全概率公式,只不过以前学的是离散的,这个是连续的

      mark

    • At time $t$, the observation $z_t$ is available and the state vector is updated using the Bayes rule

      mark

      • $p(x_t|z_{1:t})$posterior,后验概率
      • $p(z_t|x_t)$ denotes the observation likelihood
      • $p(x_t|z_{1:t-1})$state vector, 由公式(1)得到
      • $p(z_t|z_{1:t-1})$ 啥?
  • 具体到 particle filter

    • 后验概率 $p(x_t|z_{1:t})$ approximated by a finite set of $N$ samples ${ x^i_t }_{i = 1,\cdots, N}$ with importance weights $w_t^i$

    • The candidate samples $x^i_t$ are drawn from an importance distribution $q(x_t|x_{1:t-1},z_{1:t})$ and the weights of the samples are updated as

      mark

    • The samples are resampled to generate a set of equally weighted particles according to their importance weights to avoid degeneracy.

      • 怎么产生 samples 的?
    • In the case of the bootstrap filter $q(x_t|x_{1:t-1},z_{1:t}) = p(z_t|x_t)$ and the weights become the observation likelihood $p(z_t|x_t)$

  • 具体到 tracking framework

    • apply an affine image warping to model the object motion of two consecutive frames.
    • The state variable $x_t$ is modeled by the six parameters of the affine transformation parameters $x_t = (\alpha_1, \alpha_2, \alpha_3, \alpha_4, t_x, t_y)$
      • ${\alpha_1, \alpha_2, \alpha_3, \alpha_4}$ are the deformation parameters
      • $(t_x, t_y)$ are the 2D translation parameters
    • $z_t$ 是 region of interest
    • state transition distribution $p(x_t|x_{t-1})$ 是 a Gaussian distribution
    • assume the six parameters of the affine transformation are independent
    • observation model $p(z_t|x_t)$ reflects the similarity between a target candidate and the target templates
      • $p(z_t|x_t)$ 在上面也被叫做 observation likelihood
      • 在本文中 $p(z_t|x_t)$ is formulated from the error approximated by the target templates using $\ell_1$ minimization.

$\ell_1$ Minimization Tracking

Sparse Representation of a Tracking Target

  • a tracking result $\textbf{y} \in \mathbb{R}^d$ approximately lies in the linear span of $\textbf{T}$

    mark

    where $\textbf{a} = (a_1,a_2,\cdots,a_n)^T \in \mathbb{R}^n$ is called a target coefficient vector

  • To incorporate the effect of occlusion and noise, 公式(4) is rewritten as

    mark

    • $\epsilon$ is the error vector – a fraction of its entries are nonzero.
    • The nonzero entries of $\epsilon$ indicate the pixels in $\textbf{y}$ that are corrupted or occluded.
  • use trivial templates $\textbf{I} = [\textbf{i}_1, \textbf{i}_2, ..., \textbf{i}_d] \in \mathbb{R}^{d \times d}$ to capture the occlusion as

    mark

    • a trivial template $\textbf{i}_i \in \mathbb{R}^d$ is a vector with only one nonzero entry,也就是说公式(6) 中的 $\textbf{I}$ 是一个单位矩阵(identity matrix)
    • $\mathbf{e} = (e_1,e_2,\cdots, e_d)^T \in \mathbb{R}^d$ is called a trivial coefficient vector

Nonnegativity Constraints

  • 作者认为 a tracking target can almost always represented by the target templates dominated by nonnegative coefficients

    • the templates that are most similar to the tracking target are positively related to the target
    • 这些 templates 要保持更新
    • nonnegative coefficients 还有一个好处是能够 filter out clutter that is similar to target templates at reversed intensity patterns, which often happens when shadows are involved
  • 如果直接对公式(6)的系数施加非负约束的话,就强迫 trivial templates 也要正相关的,但实际上 trivial templates 也可能是负相关的,只有 target templates 才是正相关,作者对此的解决办法是添加 negative trivial templates,model (6) 可以重写为

    mark

Achieving Sparseness through $\ell_1$ Minimization

  • 求解稀疏问题

    mark

    • 前 10 个 templates 是 target templates,后面 360 个 templates 是 trivial templates
    • 如果是 good target candidate,那么就跟右上方这幅画一样,target templates 对应的系数非常突出
    • 如果是 bad target candidate,那么就跟右下方一样,trivial templates 对应的系数非常突出

    mark

  • find the tracking result by finding the smallest residual after projecting on the target template subspace, i.e., $\Vert \textbf{y} - \textbf{Ta} \Vert_2$

Template Update

  • 为什么要 template update
    • A fixed appearance template is not sufficient to handle recent changes in the video
    • If we do not update the template, the template cannot capture the appearance variations due to illumination or pose changes.
  • update too often 也不行
    • If we update the template too often, small errors are introduced each time the template is updated.
    • The errors are accumulated and the tracker drifts from the target
  • 还给每个 target template 赋了一个权重,说实话,我没看懂
    • $\ell_1$ favors the template with larger norm because of the regularization part $\Vert \textbf{c} \Vert_1$
    • The larger norm of $\Vert \textbf{t}_i \Vert$ is, the smaller coefficient $a_i$ is needed in the approximation $\Vert \textbf{y-Bc}\Vert_2$
    • 作者 exploit the characteristic by introducing a weight $w_i = \Vert \textbf{t}_i \Vert_2$ associated with each template $\textbf{t}_i$
    • Intuitively, the larger the weight is, the more important the template is.
      • 感觉这个没有道理啊,照理说,所有样本都应该归一化才对啊
  • 最初的 target templates 是怎么产生的?
    • At initialization, the first target template is manually selected from the first frame and applied zero-mean-unit-norm normalization.
      • 不是做了 unit norm 了么,为啥 $\Vert \textbf{t}_i \Vert$ 还会不一样?
    • The rest target templates are created by perturbating one pixel in four possible directions at the corner points of the first template in the first frame.
updating scheme
  • template replacement
    • If the tracking result $\textbf{y}$ is not similar to the current template set $\textbf{T}$, it will replace the least important template in $\textbf{T}$ and be initialized to have the median weight of the current templates.
    • 怎么判断是不是 similar 啊?
  • template updating
  • weight updating
    • The weight of each template increases when the appearance of the tracking result and template is close enough and decreases otherwise.
    • tempalte 跟当前的 target 相近,就增加权重

mark

我自己的总结

  • 说白了就是一个 template 做字典,然后求解 $\ell_1$ minimization 感觉不难啊
  • 对我来说新的信息量就是 trivial templates 的使用,以及 weight update,不过 weight update 感觉没看懂
  • 还有的问题就是,你的 particle filter 怎么最后没有用啊?
@article{Mei2009RobustVT,
  title={Robust visual tracking using ℓ1 minimization},
  author={Xue Mei and Haibin Ling},
  journal={2009 IEEE 12th International Conference on Computer Vision},
  year={2009},
  pages={1436-1443}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment