Created
January 29, 2017 06:03
-
-
Save sebastien-attia/59f3d17e619d5e170b0804067782a639 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Cheat sheet on Neural Networks 1 \n", | |
"\n", | |
"The following article is a cheat sheet on neural networks. My sources are based on the following course and article:\n", | |
"- the excellent [Machine Learning course](https://www.coursera.org/learn/machine-learning) on Coursera from Professor Andrew Ng, Stanford,\n", | |
"- the very good [article from Michael Nielsen](http://neuralnetworksanddeeplearning.com/chap2.html), explaining the backpropagation algorithm.\n", | |
"\n", | |
"\n", | |
"## Why the neural networks are powerful ?\n", | |
"\n", | |
"It is proven mathematically that: \n", | |
"\n", | |
"> Suppose we’re given a [continuous] function f(x) which we’d like to compute to within some desired accuracy ϵ>0. The guarantee is that by using enough hidden neurons we can always find a neural network whose output g(x) satisfies:\n", | |
"|g(x)−f(x)|<ϵ, for all inputs x. \n", | |
"\n", | |
"_Michael Nielsen — From the following [article](http://neuralnetworksanddeeplearning.com/chap4.html)_\n", | |
"\n", | |
"## Conventions \n", | |
"Let’s define a neural network with the following convention:\n", | |
"\n", | |
"L = total number of layers in the network. \n", | |
"$s_l$ = number of units (not counting bias unit) in layer l. \n", | |
"K = number of units in output layer ( = $s_L$ ). \n", | |
"\n", | |
"<img src=\"images/Neural_Network_definition.png\" />\n", | |
"\n", | |
"___With___:\n", | |
"\n", | |
"$$ \n", | |
"\\forall l \\in\\ {2, ..., L},\\ a^{(l)} \\in \\mathbb{R}^{s_{l}+1} and\\ \n", | |
"\\begin{cases}\n", | |
"a^{(l)}_0\\ =\\ 1\\ (bias\\ node)\\\\\n", | |
"a^{(l)}_i = g(z^{(l)}_i), \\forall i \\in {1, ..., s_l}\\\\ \n", | |
"\\end{cases}\\\\\n", | |
"z^{(l)}_i = [ \\theta^{(l)}_i ]^T . a^{(l-1)}, \\forall i \\in {1, ..., s_l}\\\\ \n", | |
"with\\ \\theta^{(l)}_i \\in \\mathbb{R}^{s_{l-1}+1}\\ and\\ z^{(l)} \\in \\mathbb{R}^{s_{l}}\\ \n", | |
"$$\n", | |
"\n", | |
"We define the matrix θ of the weights for the layer l as following:\n", | |
"\n", | |
"$$\n", | |
"\\theta^{(l)} \\in \\mathbb{R}^{s_l \\times (s_{(l-1)}+1)}\n", | |
"$$\n", | |
"\n", | |
"$$\n", | |
"\\theta^{(l)} = \n", | |
"\\begin{bmatrix}\n", | |
" [ \\theta^{(l)}_1 ]^T \\\\\n", | |
" [ \\theta^{(l)}_2 ]^T \\\\\n", | |
" \\vdots \\\\\n", | |
" [ \\theta^{(l)}_{s_{l}} ]^T\n", | |
"\\end{bmatrix} =\n", | |
"\\begin{bmatrix}\n", | |
" \\theta_{1,0} & \\dots & \\theta_{1,j} & \\dots & \\theta_{1,s_{l-1}} \\\\\n", | |
" \\vdots & & \\vdots & & \\vdots \\\\\n", | |
" \\theta_{i,0} & \\dots & \\theta_{i,j} & \\dots & \\theta_{i,s_{l-1}} \\\\\n", | |
" \\vdots & & \\vdots & & \\vdots \\\\\n", | |
" \\theta_{s_l,0} & \\dots & \\theta_{s_l,j} & \\dots & \\theta_{s_l,s_{l-1}} \\\\\n", | |
"\\end{bmatrix}\n", | |
"$$\n", | |
"\n", | |
"Hence, we have the following relation: \n", | |
"$$\n", | |
"a^{(l)} =\n", | |
"\\left(\\begin{smallmatrix}1 \\\\ g(\\theta^{(l)}.a^{(l-1)})\\end{smallmatrix}\\right)\n", | |
"$$\n", | |
"\n", | |
"\n", | |
"## The cost function of a Neural Network\n", | |
"\n", | |
"The training set is defined by: $ { (x^1,y^1), ..., (x^m,y^m) } $\n", | |
"\n", | |
"x and y are vectors, with respectively the same dimensions as the input and output layers of the neural network. \n", | |
"\n", | |
"The cost function of a neural network is the following:\n", | |
"\n", | |
"\n", | |
"$$\n", | |
"J(\\theta) = - \\frac{1}{m} \\sum_{i=1}^m \\sum_{k=1}^K \\left[ cost( a^{(L)}_k, y^{(i)}_k) \\right] + \\frac{\\lambda}{2m}\\sum_{l=2}^{L} \\sum_{j=0}^{s_l} \\sum_{i=1}^{s_{l+1}} ( \\theta_{i,j}^{(l)})^2\n", | |
"$$\n", | |
"\n", | |
"$a^{(L)}_k$ is the output of the neural network, and is dependent of the weights 𝜃 of the neural network. \n", | |
"\n", | |
"Now, the objective is to train the neural network and find the minimum of the cost function J(𝜃).\n", | |
"\n", | |
"## Mathematic reminder: the chain rule\n", | |
"\n", | |
"Let’s define the functions f, g and h as following:\n", | |
"\n", | |
"$$ f:\\mathbb{R}^n \\rightarrow \\mathbb{R} $$\n", | |
"\n", | |
"$$ g:\\mathbb{R}^p \\rightarrow \\mathbb{R}^n $$\n", | |
"\n", | |
"$$ h = f \\circ g $$\n", | |
"\n", | |
"The derivative of h is given by the chain rule theorem:\n", | |
"\n", | |
"$$\n", | |
"\\forall_i \\in \\{ \\!1, ... , \\!p \\}, \n", | |
"\\frac{\\partial h}{\\partial x_i} = \n", | |
"\\sum_{k=1}^{n} \\frac{\\partial f}{\\partial g_k} \\frac{\\partial g_k}{\\partial x_i}\n", | |
"$$\n", | |
"\n", | |
"(See the following [course online](https://ocw.mit.edu/courses/mathematics/18-02sc-multivariable-calculus-fall-2010/2.-partial-derivatives/) on partial derivation from the MIT)\n", | |
"\n", | |
"\n", | |
"## The backpropagation algorithm\n", | |
"\n", | |
"We use the __gradient descent__ to find the minimum of J on 𝜃: $ \\min\\limits_{\\theta} J(\\theta)$\n", | |
"\n", | |
"The gradient descent requires to compute: \n", | |
"\n", | |
"$$ \\frac{\\partial J(\\theta)}{\\partial \\theta^{(l)}_{i,j}} $$\n", | |
"\n", | |
"___In the following parts, we consider only the first part of J(θ) (as if the regularisation term λ=0). The partial derivative of the second term of J(θ) is easy to compute.___\n", | |
"\n", | |
"\n", | |
"### Definition of ẟ\n", | |
"\n", | |
"Let’s define the function ẟ. When ẟ of the layer l is multiplied by the output of the layer (l-1), we obtain the partial derivative of the cost function on θ.\n", | |
"\n", | |
"Let’s use the chain rule and develop this derivative on z:\n", | |
"\n", | |
"$$ \n", | |
"\\frac{\\partial J(\\theta)}{\\partial \\theta^{(l)}_{i,j}} \n", | |
"=\n", | |
"\\sum^{s_l}_{k = 1} \\frac{\\partial J(\\theta)}{\\partial z^{(l)}_k} \\frac{\\partial z^{(l)}_k}{\\partial \\theta^{(l)}_{i,j}}\n", | |
"$$\n", | |
"\n", | |
"(Remind that J is dependent of z)\n", | |
"\n", | |
"As: \n", | |
"$$z^{(l)}_k = [ \\theta^{(l)}_k ]^T . a^{(l-1)} = \\sum_{p=0}^{s_{l-1}} \\theta^{(l)}_{k,p} \\times a^{(l-1)}_p$$\n", | |
"\n", | |
"$$\\frac{\\partial z^{(l)}_k}{\\partial \\theta^{(l)}_{i,j}} = 0\\ for\\ k\\ ≠\\ i\\ and\\ p\\ ≠\\ j\\ in\\ the\\ sum.$$\n", | |
"\n", | |
"$$And\\ \\frac{\\partial z^{(l)}_k}{\\partial \\theta^{(l)}_{i,j}} = a^{(l-1)}_j\\ for\\ k\\ =\\ i\\ and\\ p\\ =\\ j\\ in\\ the\\ sum.$$\n", | |
"\n", | |
"We define the __output error 𝛿__: \n", | |
"$$ \n", | |
"\\delta^{(l)}_k = \n", | |
"\\frac{\\partial J(\\theta)}{\\partial z^{(l)}_k},\\ \\forall k \\in {1, ..., s_l},\\\n", | |
"and\\ \n", | |
"\\delta^{(l)} = \\nabla_{z^{(l)}} J(\\theta) \\in \\mathbb{R}^{s_{l}}\n", | |
"$$\n", | |
"\n", | |
"So we have:\n", | |
"\n", | |
"---\n", | |
"\n", | |
"$$ \n", | |
"\\frac{\\partial J(\\theta)}{\\partial \\theta^{(l)}_{i,j}} \n", | |
"=\n", | |
"\\delta^{(l)}_i . a^{(l-1)}_j\n", | |
"$$\n", | |
"\n", | |
"---\n", | |
"\n", | |
"### Value of ẟ for the layer L\n", | |
"\n", | |
"Now let’s find 𝛿 for the output layer (layer L):\n", | |
"\n", | |
"$$\n", | |
"\\delta^L_i = \\frac{\\partial J(\\theta)}{\\partial z^{(L)}_i} = \\sum^{s_{L}}_{k = 0} \\frac{\\partial J(\\theta)}{\\partial a^{(L)}_k} \\frac{\\partial a^{(L)}_k}{\\partial z^{(L)}_i}\n", | |
"$$\n", | |
"\n", | |
"As:\n", | |
"\n", | |
"$$\n", | |
"a^{(l)}_k = g(z^{(l)}_k),\\ \\frac{\\partial a^{(L)}_k}{\\partial z^{(L)}_i} = 0\\ if\\ k\\ ≠\\ i\n", | |
"$$\n", | |
"\n", | |
"Hence:\n", | |
"\n", | |
"---\n", | |
"\n", | |
"$$\n", | |
"\\delta^L_i = \n", | |
"\\frac{\\partial J(\\theta)}{\\partial a^{(L)}_i} \\frac{\\partial g(z^{(L)}_i)}{\\partial z^{(L)}_i} =\n", | |
"\\frac{\\partial J(\\theta)}{\\partial a^{(L)}_i} . g'(z^{(L)}_i)\n", | |
"$$\n", | |
"\n", | |
"---\n", | |
"\n", | |
"### Relationship of ẟ between the layer l and the layer (l-1)\n", | |
"\n", | |
"Now, we try to find a relation between 𝛿 of the layer l and 𝛿 of the next layer (l+1):\n", | |
"\n", | |
"$$\n", | |
"\\delta^{(l)}_i = \\frac{\\partial J(\\theta)}{\\partial z^{(l)}_i} = \\sum^{s_{l+1}}_{k = 1} \\frac{\\partial J(\\theta)}{\\partial z^{(l+1)}_k} \\frac{\\partial z^{(l+1)}_k}{\\partial z^{(l)}_i}\n", | |
"$$\n", | |
"\n", | |
"With:\n", | |
"\n", | |
"$$\n", | |
"z^{(l+1)}_k = [ \\theta^{(l+1)}_k ]^T . g(z^{(l)}_p) = \\sum_{p=0}^{s_{l}} \\theta^{(l+1)}_{k,p} \\times g(z^{(l)}_p)\n", | |
"$$\n", | |
"\n", | |
"And:\n", | |
"\n", | |
"$$\n", | |
"\\frac{\\partial z^{(l+1)}_k}{\\partial z^{(l)}_i} = \\theta^{(l+1)}_{k,i} . g'(z^{(l)}_i)\\ \\ for\\ p\\ =\\ i,\\ else\\ 0.\n", | |
"$$\n", | |
"\n", | |
"Hence:\n", | |
"\n", | |
"---\n", | |
"\n", | |
"$$\n", | |
"\\delta^{(l)}_i \n", | |
"=\n", | |
"\\sum^{s_{l+1}}_{k = 0} \\delta^{(l+1)}_k . \\theta^{(l+1)}_{k,i} . g'(z^{(l)}_i) \n", | |
"$$\n", | |
"\n", | |
"---\n", | |
"\n", | |
"### The backpropagation algorithm explained\n", | |
"\n", | |
"We have the following:\n", | |
"- we have found a function ẟ for the layer l such that when we multiply this function by the output of the layer (l-1), we obtain the partial derivative of the cost function J on the weights θ of the layer l,\n", | |
"- the function ẟ for the layer l has a relation with ẟ of the layer (l+1),\n", | |
"- as we have a training set, we can compute the values of ẟ for the layer L.\n", | |
"\n", | |
"So, we start to compute the values of ẟ for the layer L. As we have a relation between ẟ for the layer l and ẟ for the layer (l+1), we can compute the values for the layers (L-1), (L-2), …, 2.\n", | |
"\n", | |
"We can then compute the partial derivative of the cost function J on the weights θ of the layer l, by multiplying ẟ for the layer l by the output of the layer (l-1).\n", | |
"\n", | |
"\n", | |
"### The vectorized backpropagation’s equations\n", | |
"\n", | |
"The __first equation__ gives the partial derivatives of J with respect to θ. We have added the regularization term.\n", | |
"\n", | |
"$$\n", | |
"\\nabla_{\\theta^{(l)}} J(\\theta) = \n", | |
"\\begin{bmatrix}\n", | |
" \\frac{\\partial J}{\\partial \\theta^{(l)}_{1,0}} & \\dots & \\frac{\\partial J}{\\partial \\theta^{(l)}_{1,j}} & \\dots & \\frac{\\partial J}{\\partial \\theta^{(l)}_{1,s_{l-1}}} \\\\\n", | |
" \\vdots & & \\vdots & & \\vdots \\\\\n", | |
" \\frac{\\partial J}{\\partial \\theta^{(l)}_{i,0}} & \\dots & \\frac{\\partial J}{\\partial \\theta^{(l)}_{i,j}} & \\dots & \\frac{\\partial J}{\\partial \\theta^{(l)}_{i,s_{l-1}}} \\\\\n", | |
" \\vdots & & \\vdots & & \\vdots \\\\\n", | |
" \\frac{\\partial J}{\\partial \\theta^{(l)}_{s_l,0}} & \\dots & \\frac{\\partial J}{\\partial \\theta^{(l)}_{s_l,j}} & \\dots & \\frac{\\partial J}{\\partial \\theta^{(l)}_{s_l,s_{l-1}}} \\\\\n", | |
"\\end{bmatrix}\n", | |
"= \\delta^{(l)} \\otimes [a^{(l-1)}]^T + \\frac{\\lambda}{m} \\theta^{(l)}\n", | |
"$$\n", | |
"\n", | |
"The __second equation__ gives the relation between ẟ in layer l and ẟ in layer (l+1):\n", | |
"\n", | |
"$$\n", | |
"\\delta^{(l)} = [(\\theta^{(l+1)})^T . \\delta^{(l+1)}] \\odot g'(z^l)\n", | |
"$$\n", | |
"\n", | |
"The __third equation__ gives the value of ẟ for the layer L:\n", | |
"\n", | |
"$$\n", | |
"\\delta^{(L)} = \\nabla_{a^{(L)}} J(\\theta) \\odot g'(z^L)\n", | |
"$$\n", | |
"\n", | |
"\n", | |
"## Conclusion\n", | |
"\n", | |
"This cheat sheet explains the backpropagation algorithm used to train a neural network. I have created this article after following the great [Machine Learning’s course of Professor Andrew Ng](https://www.coursera.org/learn/machine-learning) on Coursera. The conventions used in this article are not exactly the ones used in the course of Professor Ng, nor exactly the ones used in [the article of Michael Nielsen](http://neuralnetworksanddeeplearning.com/chap2.html).\n", | |
"\n", | |
"If you notice any error, please do not hesitate to contact me. \n", | |
"\n", | |
"<div style=\"text-align: right\"> To Victor, Oscar and all those will follow </div>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.5.1" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 1 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment