TCSVT-2015-Background Prior-Based Salient Object Detection via Deep Reconstruction Residual

Abstract

原先方法的 assumption

Based on the assumption that foreground salient regions are distinctive within a certain context, most conventional approaches rely on a number of hand-designed features and their distinctiveness is measured using local or global contrast.
小目标也是这样，target assumption 是 distinctive within a certain context
不管是 local contrast measure 还是 IPI model 都是 hand-designed features

原先方法的不足

Although these approaches have been shown to be effective in dealing with simple images, their limited capability may cause difﬁculties when dealing with more complicated images.

Introduction

潜在的假设以及要解决的问题：

Based on the underlying hypothesis that the salient stimulus is distinct from its contextual stimuli, most existing saliency detection models need to solve two key problems:

1. extract effective features to represent the image
1. develop an optimal mechanism to measure the distinctiveness over the extracted features.
- 哈哈，所以 HVS 做小目标检测也可以拆成 feature extraction （往往是均值）和 distinctiveness measure 两部分，这个在 MPCM 里面体现的就非常明显，feature extraction 是计算均值，distinctiveness measure 是计算相乘什么的，总之目前 HVS 方法把这两者混在一起了，把其上升到新的框架就可以对其更进一步了。

hand designed feature 的缺点

All these feature representations are hand designed and require significant amounts of domain knowledge. However, hand-designed features in general suffer from poor generalization capability for different images, especially due to a lack of thorough understanding of the biological mechanisms and principles of human visual attention as well as weak human intuition involved.

为什么不用 sparse coding, and low-rank matrix recovery？

Nevertheless, due to the shallow-structured architectures used, these methods still have limited representational power and are insufﬁcient to capture high-level information and latent patterns of complex image data.

本文意图

investigate the feasibility of learning more powerful representation directly from the raw image data themselves in an unsupervised way for the task of saliency detection.

传统的 saliency detection 做法

The saliency or distinctiveness is typically measured by image contrast computation over features, where various contrast measures have been presented. Depending on the extent of the context in which the contrast is calculated, these approaches can be classiﬁed into local-contrast-based methods and global-contrast-based methods.

Local-contrast- based methods estimate the saliency of an image pixel or an image patch by calculating the contrast against its local neighborhood, and some representative local methods include the center-surround difference [5], [6], [12], [13], incremental coding length [10], and self-resemblance [14].
Global- contrast-based methods characterize the saliency of an image region as the uniqueness in the entire image. Previous works found in the literature have proposed a variety of approaches to model the global contrast from different perspectives. To be speciﬁc, in [15] and [16], the global contrast is derived in the frequency domain with the hypothesis that salient regions are normally less frequent.

传统方法的缺点

In spite of extensive efforts, local and global contrast- based approaches still suffer from some drawbacks.

First, these approaches normally can highlight only object boundaries but fail to detect the whole target region uniformly as shown in the examples given in Fig. 1. This problem may be alleviated in some global-contrast-based methods while the results yielded are still unsatisfactory.
- 可以用作我 weighted 的 motivation 那里
Second, although the salient objects often present high contrast, the inverse might unnecessarily be true [11]. In many complex images (as shown in the third example of Fig. 1), the background contains small-scale high- contrast patterns, which may lead to previous contrast-based methods fail in such cases.
- 可以用作 rare structure effect 的成因那里

传统 saliency detection 不好的本质原因

Essentially, the true aim of salient object detection is to ﬁnd objects that are distinctive from the image background. It needs to calculate the contrast between the objects and the image background and then select those with high contrast as salient objects.

However, the local and global contrast- based methods do not identify which regions form the image background. They blindly assume the neighboring regions or the entire image to be the background and then calculate the contrast between each location and the assumed back- ground. As their assumed background may not be the real one, the determined contrast also becomes incorrect, which in turn reduces the performance of saliency detection.

目前的 background prior-based methods 仍然不好的原因

没有考虑到下面四种情形，说白了就是人工增加的 prior 其实考虑是不全的

1. The entire image boundary is a large and smoothly connected region (see the first row of Fig. 1).
- target 没有 well defined boundary，target 和 background 的界限是缓变的
1. The regions defined within the image boundary look different, whereas they may share certain latent pattern (see the second row of Fig. 1).
- 是说目标和背景中的某些成分的 shape 等的 pattern 其实是一样的
1. The background is complex (for example, containing small-scale high-contrast patterns) and the regions of image boundary are different, as shown in the third row of Fig. 1.
- 背景本身比较复杂，含有高对比度成分，这个在小目标检测里存在，而且更加突出，因为小目标比较弱，所以背景成分的 contrast 不需要特别 high 都会很突出
- rare structure effect 的本质就是 mishighlighting the small-scale high-contrast background regions in the saliency maps.
1. Salient objects significantly touch the image boundary and parts of them are wrongly considered as the back- ground, as shown in the fourth row of Fig. 1.
- 就是目标比较大，目标内部含有突变边缘，并不符合匀质假设

本文工作

本文仍然是一种 background prior-based method，本文的创新点在于用 stacked denoising autoencoder 来 model image background，好处是

不需要再 hand-designed features ，直接无监督地从 raw data 中学习更加 powerful 的 representation
enables capturing the latent pattern of the input data hierarchically
这一点能够使得算法可以克服上面的第二点原因，也是我想用 representation learning 来做小目标检测的 motivation

本文是怎么度量显著性的？

the measure of contrast between the salient objects and the background is formulated as the reconstruction residuals in the deep-structured SDAE.

其实本文和传统 saliency detection 的区别也就是 IPI 和 HVS 方法的区别

Unlike the previous works [18], [19], which mainly focused on the way to calculate the similarity or distinctiveness between a certain image patch and the image boundary, this paper pays more attention to modeling the background regions.

Proposed approach

为什么要 deep architecture

This deep architecture allows the SAE to learn more complex mapping from the input to hidden representations and capture the latent patterns which reﬂect the most homogametic property shared among the training data.

本文用的还不是 SAE，而是 SDAE，下面这篇文章

P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” J. Mach. Learn. Res., vol. 11, pp. 3371–3408, Jan. 2010.

本文假设

为了获得更加精确的 object - background contrast，assume that the image boundary is mostly background
- 假设图像边缘主要都是背景，其实就是 center bias，其实这一点对小目标也有帮助，虽然小目标出现在哪里都有可能，但是小目标小啊，所以整幅图像，mostly 都是 background

本文算法内容

as shown in Fig. 2，本文算法由三部分构成：

multiscale inputs generation
salient region detection via deep reconstruction residual
post- processing.

我的问题

为什么 autoencoder encode 和 decode 用的都是 sigmoid 函数？decode 不应该是 encode 的反变换吗？
具体 SDAE 是怎么弄的，还是要仔细看一下

@article{Han2014BackgroundPB,
  title={Background prior-based salient object detection via deep reconstruction residual},
  author={Han, Junwei and Zhang, Dingwen and Hu, Xintao and Guo, Lei and Ren, Jinchang and Wu, Feng},
  journal={IEEE Transactions on Circuits and Systems for Video Technology},
  volume={25},
  number={8},
  pages={1309--1321},
  year={2014},
  publisher={IEEE}
}

YimianDai/Han2014BackgroundPB.md