2005 - CVPR - Object recognition with features inspired by visual cortex

Introduction

Hierarchical approaches popular 的原因

和 hierarchical nature of primate visual cortex 相符合
hierarchical approaches have been shown to consistently outperform ﬂat single-template (holistic) object recognition systems
- 就是效果好

Biological visual systems as guides

HVS 乃至 brain-inspired 的原因

Because humans and primates outperform the best machine vision systems by almost any measure, building a system that emulates object recognition in cortex has always been an attractive idea.

目前 HVS 的不足

However, for the most part, the use of visual neuroscience in computer vision has been limited to a justification of Gabor filters.
- 这个吐槽在十年后还是在小目标里面成立啊，哈哈哈
No real attention has been given to biologically plausible features of higher complexity.
While mainstream computer vision has always been inspired and challenged by human vision, it seems to never have advanced past the first stage of processing in the simple cells of primary visual cortex V1.
这篇论文就是 an attempt to bridge the gap between computer vision and neuroscience.

a quantitative summarization of standard model of object recognition in primate cortex

the first few hundreds milliseconds of visual processing in primate cortex follows a mostly feedforward hierarchy.
At each stage, the receptive fields of neurons (i.e., the part of the visual field that could potentially elicit a neuron’s response) tend to get larger along with the complexity of their optimal stimuli (i.e., the set of stimuli that elicit a neuron’s response).
In its simplest version, the standard model consists of four layers of computational units where
- simple S units, which combine their inputs with Gaussian-like tuning to increase object selectivity,
- alternate with complex C units, which pool their inputs through a maximum operation, thereby introducing gradual invariance to scale and translation.
The model has been able to quantitatively duplicate the generalization properties exhibited by neurons in inferotemporal monkey cortex (the so-called view-tuned units) that remain highly selective for particular objects (a face, a hand, a toilet brush) while being invariant to ranges of scales and positions.
The model originally used a very simple static dictionary of features (for the recognition of segmented objects) although it was suggested in [16] that features in intermediate layers should instead be learned from visual experience.
- 也就是说 features 还是要学习的

The C2 features

Our approach is summarized in Fig. 1

the ﬁrst two layers correspond to primate primary visual cortex, V1
the ﬁrst visual cortical stage, which contains simple (S1) and complex (C1) cells [8].
The S1 responses are obtained by applying to the input image a battery of Gabor ﬁlters, which can be described by the following equation:
- orientation θ , effective width σ , and wavelength λ , 都是要 tunning 的
- tuning proﬁles of S1 units match those of V1 parafoveal simple cells
- After removing ﬁlters that were incompatible with biological cells [8], we were left with a ﬁnal set of 16 ﬁlters at 4 orientations
  - 16 个 scales，4 个 orientations，一共 64 个 filters
The next stage – C1 – corresponds to complex cells
- show some tolerance to shift and size
- complex cells tend to have larger receptive ﬁelds (twice as large as simple cells), respond to oriented bars or edges anywhere within their receptive ﬁeld [8] (shift invariance)
- are in general more broadly tuned to spatial frequency than simple cells [8] (scale invariance).
- Riesenhuber & Poggio proposed a max-like pooling operation for building position- and scale- tolerant C1 units
- Fig. 2 illustrates how pooling from S1 to C1 is done
  - S1 units come in 16 scales s arranged in 8 bands Σ . （看 Table 1 就知道了）

Discussion

Our system ﬁrst computes a set of scale- and translation-invariant C2 features from a training set of images and then runs a standard discriminative classiﬁer on the vector of features obtained from the input image.
This system belongs to a family of feedforward models of object recognition in cortex that have been shown to be able to duplicate the tuning properties of neurons in several visual cortical areas.
the tuning properties of view-tuned units in inferotemporal cortex
- respond to images of the learned object more strongly than to distractor objects, despite significant changes in position and size
说 state-of-the-art method 并没有从 biology 出发
- While a long-time goal for computer vision has been to build a system that achieves human-level recognition performance, state-of-the-art algorithms have been diverging from biology:
- for instance, some of the best existing systems use geometrical information about the constitutive parts of objects (constellation approaches rely on both appearance-based and shape-based models and component-based system use the relative position of the detected components along with their associated detection values).
- Biology is however unlikely to be able to use geometrical information – at least in the cortical stream dedicated to shape processing and object recognition.
- The system described in this paper is respects the properties of cortical processing (including the absence of geometrical information) while showing performance at least comparable to the best computer vision systems.
解释本文方法为什么好？
- The fact that this biologically-motivated model outperforms more complex computer vision systems might at ﬁrst appear puzzling.
- The architecture performs only two major kinds of computations (template matching and max pooling)
- Perhaps part of the model’s strength comes from its built-in gradual shift- and scale-tolerance that closely mimics visual cortical processing, which has been ﬁnely tuned by evolution over thousands of years.
- It is also very likely that such hierarchical architectures ease the recognition problem by decomposing the task into several simpler ones at each layer
- Finally it is worth pointing out that the set of C2 features that is passed to the ﬁnal classiﬁer is very redundant, probably more redundant than for other approaches.
- While we showed that a relatively small number of features (about 50) is sufﬁcient to achieve good error rates, performance can be increased signiﬁcantly by adding many more features.

@article{Serre2005ObjectRW,
  title={Object recognition with features inspired by visual cortex},
  author={Thomas Serre and Lior Wolf and Tomaso A. Poggio},
  journal={2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05)},
  year={2005},
  volume={2},
  pages={994-1000 vol. 2}
}

YimianDai/Serre2005ObjectRW.md

2005 - CVPR - Object recognition with features inspired by visual cortex

Introduction

Hierarchical approaches popular 的原因

Biological visual systems as guides

HVS 乃至 brain-inspired 的原因

目前 HVS 的不足

a quantitative summarization of standard model of object recognition in primate cortex

The C2 features

Our approach is summarized in Fig. 1

Discussion