If we take out the word Artificial from ANN it leaves us with ‘Neural Networks’ which basically indicates our Brain. What ANN does is, it tries to mimic human brain, and hence the word Artificial. So, for understanding ANN let’s start with the basics of neuron. Simply speaking, neuron is a single entity which takes some input, computes the result and gives the output. Now who does it take input from? It takes input from several other neurons and gives output to another neuron.
Here Dendrites are the inputs, cell body is responsible for computing results and through Axon the output is given out. Coming back to ANN, the network comprises of a Node --> Cell neucleous, the node gets inputs initialized with some weights --> Dendrites and synapses, and gives an output --> Axon.
As the basic structure is unerstood now, lets move forward with the training process.
The steps involved in training ANN involves:
- Normalizing the inputs and initializing the weights
- Activation function for the Neuron
- Cost function
- Minimizing cost function: Gradient Descent
- Backpropagation
- Repeating the process
Now let’s look at the steps one by one.
1. Normalizing the inputs:
If the inputs to a network are not normalized then it makes the learning process slow and it takes more time to converge to the output. Normalizing the inputs involves 3 steps as follows:
i. The average of each input variable over the training set should be zero
ii. Scale input variables so that their covariance are about the same
iii. Input variables should be uncorrelated if possible
2. Activation Function for the Neuron:
Activation function is the rule which is used by the neuron to compute the result. As an output of activation function, the neuron either fires or does not fire.
There are many Activation Functions and some of them are as follows:
• Threshold function: This is a step function and the neuron will fire if the input value is more than the threshold value.
• Sigmoid Function: This function gives the probablistic approch and the output value of this fuction lies between 0 to 1. The neuron fires if the value of the function is close to 1. (Generally >0.5)
• Rectifier Function: This function computes max(x,0) where x is the input.
• Hyperbolic Tangent Function: This function is similar to the sigmoid function with the only difference that the output values range between -1 to +1.
The most commonly used function for the hidden layers is Rectifier function and usually for the output layer we use Sigmoid function as we can use probabilistic approach by its use.
3. Cost Function:
The output from the node (neuron) is called y^. This predicted output is compared with the actual output y to calculate the error, which is the cost function. The aim of building ANN is to minimize this cost function.
The inference from error is backpropagated and the weights are updated.
4. Gradient Descent:
For minimizing the cost function, Gradient descent algorithm is used. Gradient descent works the best when the cost function is convex. But if it is not perfectly convex then gradient descent is likely to give local minimum instead of the global minimum. To tackle such situations, Stochastic gradient descent is used.
Unlike batch gradient descent where all the input rows are fed to the network and the combined cost function is used to update weights, in Stochastic process, one input row is taken at a time and fed to the network and the weights are updated. Here this process repeats for each row. A fact about stochastic gradient descent is it is faster as it does not require all the data to be stored in the working memory. Another fact is that due to the stochastic (random) nature of this process, it yields different outputs every time even when the initial weights are the same.
5. Backpropagation:
As already explained, the results from gradient descent are backpropagated to the nodes and the weights are updated.
6. Repeating the process:
Once the weights are updated taking into consideration all the input rows, it comprises of one epoch. The entire process is run several times until the cost function is minimized i.e. high accuracy is reached.