Why do CNN need millions of images to get trained? Neural networks are seen as the replication of human brain. But on the contrary, our brain does not require millions of images before we start to recognize any object, then why do ConvNets have such requirement?
The answer to this question lies in the fundamentals of how CNN performs the task of image segmentation or recognition. CNN is totally based on pixel intensities or you can say finding out the contents of the image irrespective of the relative position of the contents to each other. As there is no spatial correlation which gets recorded in CNN, it cannot recognize the same object in different pose or from different viewing angle. For example, you train a CNN with the front profile of yours and then feed your side profile, the CNN probably will not be able to recognize the face because the spatial coordinate information is not present.
To solve this problem the concept of Capsule theory is suggested. This theory works on the spatial orientation, or pose, rather than pixel intensity. So what it basically does is it tries to see the image into 3 dimensions so that the model is immune to any spatial variations.
So let’s see what is wrong with CNN and how Capsule Networks can overcome that:
- CNN have too few levels of structure, Neurons, Layers and whole network. There are no entities.
- If the neurons in each layer are grouped to perform a particular function and output a compact result, the group is called as a Capsule. Capsules make the network modular.
- Capsules do Coincidence Filtering to discard the outliers. If it finds a tight cluster, it outputs a high probability that the entity of its type exists and it also gives the centre of gravity of the cluster.
- This approach is good at filtering because high dimensional coincidences do not happen by chance.
- CNNs involve pooling operation, which can be a bad choice because of the following reasons:
- It is a bad fit to the psychology of shape perception. Humans assign intrinsic co-ordinate frame to objects to interpret their spatial orientation. Pooling fails to explain that.
- It solves the wrong problem. Pooling aims at spatial invariance rather than equivariance. CNNs try to make the neural activities invariant to small changes in the viewpoint by combining the activities within a pool. This is a wrong goal and the actual goal should be to aim for equivariance. Equivariance is when the change in the viewpoint leads to corresponding change in the neural activity.
- It fails to use the underlying linear structure in images.
- Pooling is a bad way to do dynamic routing. We need to route each part of the input to the neurons that know how to deal with it.
- CNN need to learn different models for different viewpoints. This requires a lot of pooling data.
- This happens because the manifold of the images of the same rigid shape is highly non-linear in the space of pixel intensities.
- If we transform to a space in which this manifold is linear, then the extrapolation of the shape recognition becomes easier. This is exactly what CapsNet does.
These are some of the reasons where CNN does poorly and Capsule Networks outperform. In brief, CapsNet translates from the space of pixels to the space of spatial correlations, hence catering better to the viewpoint variations. Also because of this, CapsNet requires very less images to get trained as compared to that of a conventional CNN.