- sparse coding methods in visual tracking can be categorized into
- appearance modeling based on sparse coding (AMSC)
- target searching based on sparse representation (TSSR)
- their combination
- 结论
- (1) AMSC methods significantly outperform TSSR methods
- (2) For AMSC methods, both discriminative dictionary and spatial order reserved pooling operators are important for achieving high tracking accuracy
- (3) For TSSR methods, the widely used identity pixel basis will degrade the performance when the target or candidate images are not aligned well or severe occlusion occurs
- 这也是为什么 小目标检测里 PCA、sparse representation 不好的原因
- 这也是 IPI 模型值得改进的地方(郭哥新论文的想法,Low-Rank Prior in Single Patches for Nonpointwise Impulse Noise Removal)
- (4) For TSSR methods,
$l_1$ norm minimization is not necessary. In contrast,$l_2$ norm minimization can obtain comparable performance but with lower computational cost.
- Usually, it is formulated as a search problem that aims at finding the candidate most matching to the target template as the tracking result.
-
A target template is maintained over time and may be updated online once the tracking result is available.
-
Before starting tracking at the current time, a set of candidates are sampled around the state of the target at the last time.
-
Both the target template and candidates are represented using an appearance model.
-
Then, a target searching strategy is used to find the candidate most matching to the template appearance as the tracking result.
-
A typical tracking process contains several stages as shown in Fig. 1.
- From the point of view of appearance modeling, discriminating the target from the background is a very basic ability and plays a key role in complex scenes where the contrast between the target and background is low.
- To achieve reliable tracking performance, it is also very important to handle target appearance variations during tracking, which contain both the intrinsic variations such as pose changes and shape deformation and extrinsic variations such as illumination and occlusion.
- To handle these variations, a good appearance model is desired to meet two requirements: adaptivity that adapts to the intrinsic appearance variations and robustness that is invariant to the extrinsic appearance variations.
- 要能够 handle 外部 和 内部的变化
- computation complexity is an very important issue since the real-time tracking speed is a practical requirement of most subsequent high-level applications such as action recognition and retrieval.
- In addition, it is also possible to handle appearance variations in target searching stage, which is ignored by most existing methods.
- (1) What is the connection and difference between tracking methods based on sparse representation or sparse coding ?
- classify these methods according to which stage (appearance modeling or target searching) sparse coding is used in.
-
emphasize the difference between sparse representation and sparse coding
- Sparse representation, in fact, a sub-process of sparse coding, can be used to perform target searching (TSSR), which is the motivation of the pioneering work [19].
- On the other hand, sparse coding learns local representations of image patches, which can be used to model target appearance (AMSC).
- (2) Why sparse coding would be useful for visual tracking?
- no work trying to analyze the rationales behind these sparse coding based tracking methods.
- try to answer this question by analyzing
- the roles of sparse representation from the point of view of signal processing
- the roles of sparse coding from the point of view of biologically inspired representation mechanism of simple cells in visual cortex.
- (3) Does sparse coding really benefit visual tracking?
-
$l_1$ norm minimization is not necessary.
-
- sparse coding, in fact, contains two processes: sparse representation and dictionary learning
- sparse coding 包括 sparse representation 和 dictionary lerning 两部分
- sparse representation focuses on representing an input signal using a given dictionary with sparsity constrain on the representation coefficients.
- However, sparse coding focuses on learning a dictionary from natural images to represent the underling structure primitives in images.
- After representing an image using the learned dictionary, the representation coefficients can be used as features to describe the appearance of the image, which is inspired by the properties of the receptive fields of simple cells in visual cortex.
- Sparse representation can be used to perform target searching.
- Sparse coding including both sparse representation and dictionary learning can be used to model the target appearance for visual tracking.
- The sparse coding model was initially proposed to model natural image statistics.
- the representation coefficients can be used as features to describe the appearance of image patches.
- We call this ability of sparse coding as local appearance representation (LAR).
- 因为是 image patch,所以才称为 local
- In this section, we explain why LAR is useful to model target appearance from the point of view of biologically inspired object representation.
- HMAX [20]: biologically inspired representation model
- starts with a grayscale image layer and alternates between ‘‘S’’ and ‘‘C’’ layers.
- The ‘‘S’’ layer uses Gaussians derivative filters to compute higher-order features by combining different types of units in the previous layers, which simulates the receptive fields of simple cells in visual cortex.
- 要 combine different types,这应该是构成 SR-MPCM 的 motivation,weighted scheme 就是 combine different types
- The ‘‘C’’ layer achieves invariance by pooling units of the same type in the previous layer over local ranges.
- HMAX 的改进:
- 【30】:CVPR - 2005 - Object recognition with features inspired by visual cortex
- 【31】:IJCV - 2008 - Object Class Recognition and Localization Using Sparse Features with Limited Receptive Fields
- difference between the original HMAX model and its modified versions
- The difference between the original HMAX model and its modified versions mainly focuses on the filters used in the ‘‘S’’ layer.
- Although the responses of the Gaussians derivative filters used in the original HMAX model have similar properties with the simple cells in striate cortex, it is still non-biological because it neglects the response saturation of V1 cells.
- HMAX model 并没有很好的模拟人脑
- The modified HMAX models use Gabor filters [32] to replace the Gaussians derivative filters because Gabor filters have been extensively used to model the receptive fields of simple cells.
- On the other hand, Gabor filters have more parameters and allow more accurate tuning than the Gaussians derivative filters.
- 但是,Gabor filters 也存在缺点
- First, it is difficult and time-consuming to set a large number of parameters to effectively simulate the response properties of receptive fields of simple cells in visual cortex.
- Second, Gabor filters are hand-designed with fixed formulation for any image datasets.
- However, the structure primitives underling in different image datasets are different. Therefore, it is not reasonable to use the hand-designed Gabor filters to compute the responses of images from different datasets.
- 相对于 Gabor filters,sparse coding 的好处在于
- sparse coding basis functions are directly learned from natural images
- are capable of adapting to different image datasets
- only a few parameters needed to be tuned
- the learned basis functions captured the structure primitives underling in the image datasets
- Most importantly, the responses of sparse coding are localized, oriented and bandpass, which are more similar with the receptive fields of simple cells in visual cortex than Gabor filters [33].
- sparse coding 更加接近 大脑皮层的 behavior
- Therefore, sparse coding is more suitable for biologically inspired appearance modeling for visual tasks.
- believe that the responses of sparse coding are more suitable to model target appearance for visual tracking.
- Therefore, sparse representation can be used to find the minimal subspace the input signal belongs to from all subspaces. We call this ability as minimal subspace searching (MSS).
- In the context of face recognition, when the mth subspace is spanned by the training face images from the mth class and the input signal is a test face image, MSS can be used to recognize which class the test face image belongs to Wright et al. [18].
- SRC 的工作原理,给了我 某一类样本的子空间 的这个概念
- 这一块的 pioneering work
- ICCV - 2009 - Robust visual tracking using L1 minimization
- Motivation:
- Tracking can be solved by recognizing each candidate as the target or background.
- casting visual tracking as such a supervised classification problem
- search the target template from all target candidates
- For example, the target template can be linearly represented by all target candidates.
- 注意,是用 target candidates 作为字典来表示事先给定的 target template,也可以啊,非常好的想法,我原本想的是用一些 target templates 作为字典来表示 target candidate
- 用 target candidates 作为字典来表示事先给定的 target template,这样做是为了节省计算量,这样就只要计算一次 sparse coding 就可以了,而我原先的想法则要计算 target candidates 的 number 次,这是 [43] 的contribution
- A reasonable assumption is that at least one candidate is similar enough with the target.
- Therefore, the representation coefficients are sparse and can be computed using Eq. (4).
- The coefficients can be directly used to measure the similarities between the target template and candidates.
- The candidate with the largest coefficient can be chosen as the tracking result.
- For example, the target template can be linearly represented by all target candidates.
-
how to handle target appearance variations during tracking
-
the MSS ability of sparse representation can be used to solve this problem
-
appearance variations can be treated as the representation error n in Eq. (1), which can also be represented by a dictionary O and the corresponding coefficients z
- When the suitable dictionary O is chosen, the representation coefficient vector ~u is also sparse and can be computed using Eq. (4). Then the target searching can be finished as like in the case without appearance variations discussed previously.
-
- ICA is a special case of sparse coding and also has similar properties with receptive fields of simple cells in visual cortex
- abstract a basic framework of the appearance model based on sparse coding as shown in Fig. 3,
- contains three main layers
- image layer
- coding layer
- pooling layer.
- The image layer provides input to the coding layer, which can be a gray image or color image or a set of feature descriptors such as SIFT [36] extracted from the color or gray image.
- The coding layer uses sparse coding (or its special cases, such as, ICA or K-means) to compute the coefficients of representing each image patch using the learned dictionary.
- The pooling layer computes the statistics of all coefficients to obtain a final feature representation of the input image.
- the goal of dictionary learning
- to learn a dictionary with which the coding coefficients of the target or background patches can be used as features to distinguish the target from the background well.
- 字典要有判别性(discriminative ability)
- the role of pooling operator
- to compute the final feature vector z based on some statistics of the local codes
$u_{i,k}$ obtained in the coding layer to model the appearance of the target.
- to compute the final feature vector z based on some statistics of the local codes
- concatenating pooling
- obtained the final feature vector by concatenating all local codes
$z^T = [u^T_1,\ldots, u^T_N]$ - The advantage of the concatenating pooling
- the spatial order of local codes is preserved, which significantly increases the discriminative ability of the resulting feature vector
- drawbacks
- very sensitive to image noise because each local code is an element of the concatenated feature vector. Once noise appears in the image, the resulting feature vector will be directly affected by the noise.
- the dimensionality of the final feature vector is extremely high.
- obtained the final feature vector by concatenating all local codes
- max pooling operator
- compute the final feature vector,
$z_k = \max { |u_{1,k}|, \ldots, |u_{N,k}| }$ - motivated by the biophysical evidence in visual cortex [30]
- advantage
- invariant properties, such as position and scale invariances
- drawbacks
- not suitable for visual tracking due to the lose of discriminative ability because each dimension of the resulting feature vector is the maximum coefficients of its corresponding basis function over all patches, which ignores other non-maximum coefficients.
- average pooling
- uses all local codes to compute the final feature vector
- advantage
- increases the discriminative ability compared with the max pooling
- drawbacks
- loses the spatial order of local codes, which is similar to the disadvantage of the widely used color histogram representation.
- 作者认为,John Wright 和 Yi Ma 的 SRC 就是 exploits the minimal subspace searching (MSS) ability
- The dictionary X consists of three parts: target dictionary T, background dictionary B and variation dictionary V
- Given any candidate y, dictionary X associated with coefficients u will produce a sparse representation of the candidate.
- When the candidate is from the background, the background dictionary will be used to represent it, which will result in the non-zero coefficients corresponding to the background dictionary.
- use gradients along the horizontal and vertical directions to replace the downsampled intensity features [44]
- use multiple feature descriptors in the dictionary. Each dictionary atom is obtained by concatenating multiple feature descriptors extracted from the target image.
- When computing the representation coefficients using Eq. (4), the sparseness constrain plays an important role to make the different descriptors compete with each other to represent the target candidate, which achieves the purpose of fusing multiple kinds of feature descriptors.
- reduce the computational cost of each
$l_1$ -norm minimization- choose low dimensional but more discriminative features
- use covariance matrix to represent the target or candidate
- covariance matrix enables efficient fusion of different types of features, where the spatial and statistical properties as well as their correlation are characterized, and its dimension is small
- reduce the number of
$l_1$ -norm minimizations- each target candidate is represented by the dictionary. Therefore, the number of
$l_1$ minimization equals to the number of candidates. To reduce the number of$l_1$ norm minimization
- each target candidate is represented by the dictionary. Therefore, the number of
- CVPR - 2011 - Real-time visual tracking using compressive sensing
- exploits the restricted isometry property (RIP) [16] in compressive sensing to reduce the dimension of the dictionary by multiplying a hashing matrix that guarantees the RIP in the two sides of Eq. (1).
- Zhang Kaihua 的 ECCV - 2012 - Real-time compressive tracking
- exploits RIP to reduce the dimension of the target representation and achieves real-time tracking.
- 本文不 classify CT tracker 的原因
- all the trackers reviewed in this paper are based on a generative model, e.g., the image patch is modeled as a linear superposition of basis functions in AMSC
- CT tracker is based on a linear transformation model, which compresses the high dimensional feature vector into a low dimensional space while preserve some kind of distance metric
- CT tracker is significantly different with the reviewed methods.
- tried to combine them together to build a sparsity based collaborative model for visual tracking [59].
- CVPR - 2012 - Robust object tracking via sparsity-based collaborative model
@article{Zhang2013SparseCB,
title={Sparse coding based visual tracking: Review and experimental comparison},
author={Shengping Zhang and Hongxun Yao and Xin Sun and Xiusheng Lu},
journal={Pattern Recognition},
year={2013},
volume={46},
pages={1772-1788}
}