Why do CNN need millions of images to get trained? Neural networks are seen as the replication of human brain. But on the contrary, our brain does not require millions of images before we start to recognize any object, then why do ConvNets have such requirement?
The answer to this question lies in the fundamentals of how CNN performs the task of image segmentation or recognition. CNN is totally based on pixel intensities or you can say finding out the contents of the image irrespective of the relative position of the contents to each other. As there is no spatial correlation which gets recorded in CNN, it cannot recognize the same object in different pose or from different viewing angle. For example, you train a CNN with the front profile of yours and then feed your side profile, the CNN probably will not be able to recognize the face because the spatial coordinate information is not present.
To solve this problem the concept of Capsule theory is suggested. This theory works on the spatial orientation, or