sebastien-attia · January 29, 2017 06:03
diff --git a/Test.ipynb b/Test.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Cheat sheet on Neural Networks 1  \n",
    "\n",
    "The following article is a cheat sheet on neural networks. My sources are based on the following course and article:\n",
    "- the excellent [Machine Learning course](https://www.coursera.org/learn/machine-learning) on Coursera from Professor Andrew Ng, Stanford,\n",
    "- the very good [article from Michael Nielsen](http://neuralnetworksanddeeplearning.com/chap2.html), explaining the backpropagation algorithm.\n",
    "\n",
    "\n",
    "## Why the neural networks are powerful ?\n",
    "\n",
    "It is proven mathematically that:  \n",
    "\n",
    "> Suppose we’re given a [continuous] function f(x) which we’d like to compute to within some desired accuracy ϵ>0. The guarantee is that by using enough hidden neurons we can always find a neural network whose output g(x) satisfies:\n",
    "|g(x)−f(x)|<ϵ, for all inputs x.  \n",
    "\n",
    "_Michael Nielsen — From the following [article](http://neuralnetworksanddeeplearning.com/chap4.html)_\n",
    "\n",
    "##  Conventions  \n",
    "Let’s define a neural network with the following convention:\n",
    "\n",
    "L = total number of layers in the network.  \n",
    "$s_l$ = number of units (not counting bias unit) in layer l.  \n",
    "K = number of units in output layer ( = $s_L$ ).  \n",
    "\n",
    "<img src=\"images/Neural_Network_definition.png\" />\n",
    "\n",
    "___With___:\n",
    "\n",
    "$$  \n",
    "\\forall l \\in\\ {2, ..., L},\\ a^{(l)} \\in \\mathbb{R}^{s_{l}+1} and\\ \n",
    "\\begin{cases}\n",
    "a^{(l)}_0\\ =\\ 1\\ (bias\\ node)\\\\\n",
    "a^{(l)}_i = g(z^{(l)}_i), \\forall i \\in {1, ..., s_l}\\\\  \n",
    "\\end{cases}\\\\\n",
    "z^{(l)}_i = [ \\theta^{(l)}_i ]^T . a^{(l-1)}, \\forall i \\in {1, ..., s_l}\\\\   \n",
    "with\\ \\theta^{(l)}_i \\in \\mathbb{R}^{s_{l-1}+1}\\ and\\ z^{(l)} \\in \\mathbb{R}^{s_{l}}\\ \n",
    "$$\n",
    "\n",
    "We define the matrix θ of the weights for the layer l as following:\n",
    "\n",
    "$$\n",
    "\\theta^{(l)} \\in \\mathbb{R}^{s_l \\times (s_{(l-1)}+1)}\n",
    "$$\n",
    "\n",
    "$$\n",
    "\\theta^{(l)} = \n",
    "\\begin{bmatrix}\n",
    "    [ \\theta^{(l)}_1 ]^T \\\\\n",
    "    [ \\theta^{(l)}_2 ]^T \\\\\n",
    "    \\vdots \\\\\n",
    "    [ \\theta^{(l)}_{s_{l}} ]^T\n",
    "\\end{bmatrix} =\n",
    "\\begin{bmatrix}\n",
    "    \\theta_{1,0} & \\dots & \\theta_{1,j} & \\dots  & \\theta_{1,s_{l-1}} \\\\\n",
    "    \\vdots       &       & \\vdots       &        & \\vdots \\\\\n",
    "    \\theta_{i,0} & \\dots & \\theta_{i,j} & \\dots  & \\theta_{i,s_{l-1}} \\\\\n",
    "    \\vdots       &       & \\vdots       &        & \\vdots \\\\\n",
    "    \\theta_{s_l,0} & \\dots & \\theta_{s_l,j} & \\dots  & \\theta_{s_l,s_{l-1}} \\\\\n",
    "\\end{bmatrix}\n",
    "$$\n",
    "\n",
    "Hence, we have the following relation: \n",
    "$$\n",
    "a^{(l)} =\n",
    "\\left(\\begin{smallmatrix}1 \\\\ g(\\theta^{(l)}.a^{(l-1)})\\end{smallmatrix}\\right)\n",
    "$$\n",
    "\n",
    "\n",
    "## The cost function of a Neural Network\n",
    "\n",
    "The training set is defined by: $ { (x^1,y^1), ..., (x^m,y^m) } $\n",
    "\n",
    "x and y are vectors, with respectively the same dimensions as the input and output layers of the neural network.  \n",
    "\n",
    "The cost function of a neural network is the following:\n",
    "\n",
    "\n",
    "$$\n",
    "J(\\theta) = - \\frac{1}{m} \\sum_{i=1}^m \\sum_{k=1}^K \\left[ cost( a^{(L)}_k, y^{(i)}_k) \\right] + \\frac{\\lambda}{2m}\\sum_{l=2}^{L} \\sum_{j=0}^{s_l} \\sum_{i=1}^{s_{l+1}} ( \\theta_{i,j}^{(l)})^2\n",
    "$$\n",
    "\n",
    "$a^{(L)}_k$ is the output of the neural network, and is dependent of the weights 𝜃 of the neural network.  \n",
    "\n",
    "Now, the objective is to train the neural network and find the minimum of the cost function J(𝜃).\n",
    "\n",
    "## Mathematic reminder: the chain rule\n",
    "\n",
    "Let’s define the functions f, g and h as following:\n",
    "\n",
    "$$ f:\\mathbb{R}^n \\rightarrow \\mathbb{R}  $$\n",
    "\n",
    "$$ g:\\mathbb{R}^p \\rightarrow \\mathbb{R}^n $$\n",
    "\n",
    "$$ h = f \\circ g $$\n",
    "\n",
    "The derivative of h is given by the chain rule theorem:\n",
    "\n",
    "$$\n",
    "\\forall_i \\in \\{ \\!1, ... , \\!p \\}, \n",
    "\\frac{\\partial h}{\\partial x_i} = \n",
    "\\sum_{k=1}^{n} \\frac{\\partial f}{\\partial g_k} \\frac{\\partial g_k}{\\partial x_i}\n",
    "$$\n",
    "\n",
    "(See the following [course online](https://ocw.mit.edu/courses/mathematics/18-02sc-multivariable-calculus-fall-2010/2.-partial-derivatives/) on partial derivation from the MIT)\n",
    "\n",
    "\n",
    "## The backpropagation algorithm\n",
    "\n",
    "We use the __gradient descent__ to find the minimum of J on 𝜃: $ \\min\\limits_{\\theta} J(\\theta)$\n",
    "\n",
    "The gradient descent requires to compute: \n",
    "\n",
    "$$ \\frac{\\partial J(\\theta)}{\\partial \\theta^{(l)}_{i,j}} $$\n",
    "\n",
    "___In the following parts, we consider only the first part of J(θ) (as if the regularisation term λ=0). The partial derivative of the second term of J(θ) is easy to compute.___\n",
    "\n",
    "\n",
    "### Definition of ẟ\n",
    "\n",
    "Let’s define the function ẟ. When ẟ of the layer l is multiplied by the output of the layer (l-1), we obtain the partial derivative of the cost function on θ.\n",
    "\n",
    "Let’s use the chain rule and develop this derivative on z:\n",
    "\n",
    "$$ \n",
    "\\frac{\\partial J(\\theta)}{\\partial \\theta^{(l)}_{i,j}} \n",
    "=\n",
    "\\sum^{s_l}_{k = 1} \\frac{\\partial J(\\theta)}{\\partial z^{(l)}_k} \\frac{\\partial z^{(l)}_k}{\\partial \\theta^{(l)}_{i,j}}\n",
    "$$\n",
    "\n",
    "(Remind that J is dependent of z)\n",
    "\n",
    "As: \n",
    "$$z^{(l)}_k = [ \\theta^{(l)}_k ]^T . a^{(l-1)} = \\sum_{p=0}^{s_{l-1}} \\theta^{(l)}_{k,p} \\times a^{(l-1)}_p$$\n",
    "\n",
    "$$\\frac{\\partial z^{(l)}_k}{\\partial \\theta^{(l)}_{i,j}} = 0\\ for\\ k\\ ≠\\ i\\ and\\ p\\ ≠\\ j\\ in\\ the\\ sum.$$\n",
    "\n",
    "$$And\\ \\frac{\\partial z^{(l)}_k}{\\partial \\theta^{(l)}_{i,j}} = a^{(l-1)}_j\\ for\\ k\\ =\\ i\\ and\\ p\\ =\\ j\\ in\\ the\\  sum.$$\n",
    "\n",
    "We define the __output error 𝛿__: \n",
    "$$ \n",
    "\\delta^{(l)}_k = \n",
    "\\frac{\\partial J(\\theta)}{\\partial z^{(l)}_k},\\ \\forall k \\in {1, ..., s_l},\\\n",
    "and\\  \n",
    "\\delta^{(l)} = \\nabla_{z^{(l)}} J(\\theta) \\in \\mathbb{R}^{s_{l}}\n",
    "$$\n",
    "\n",
    "So we have:\n",
    "\n",
    "---\n",
    "\n",
    "$$ \n",
    "\\frac{\\partial J(\\theta)}{\\partial \\theta^{(l)}_{i,j}} \n",
    "=\n",
    "\\delta^{(l)}_i . a^{(l-1)}_j\n",
    "$$\n",
    "\n",
    "---\n",
    "\n",
    "### Value of ẟ for the layer L\n",
    "\n",
    "Now let’s find 𝛿 for the output layer (layer L):\n",
    "\n",
    "$$\n",
    "\\delta^L_i =  \\frac{\\partial J(\\theta)}{\\partial z^{(L)}_i} = \\sum^{s_{L}}_{k = 0} \\frac{\\partial J(\\theta)}{\\partial a^{(L)}_k} \\frac{\\partial a^{(L)}_k}{\\partial z^{(L)}_i}\n",
    "$$\n",
    "\n",
    "As:\n",
    "\n",
    "$$\n",
    "a^{(l)}_k = g(z^{(l)}_k),\\ \\frac{\\partial a^{(L)}_k}{\\partial z^{(L)}_i} = 0\\ if\\ k\\ ≠\\ i\n",
    "$$\n",
    "\n",
    "Hence:\n",
    "\n",
    "---\n",
    "\n",
    "$$\n",
    "\\delta^L_i = \n",
    "\\frac{\\partial J(\\theta)}{\\partial a^{(L)}_i} \\frac{\\partial g(z^{(L)}_i)}{\\partial z^{(L)}_i} =\n",
    "\\frac{\\partial J(\\theta)}{\\partial a^{(L)}_i} . g'(z^{(L)}_i)\n",
    "$$\n",
    "\n",
    "---\n",
    "\n",
    "### Relationship of ẟ between the layer l and the layer (l-1)\n",
    "\n",
    "Now, we try to find a relation between 𝛿 of the layer l and 𝛿 of the next layer (l+1):\n",
    "\n",
    "$$\n",
    "\\delta^{(l)}_i =  \\frac{\\partial J(\\theta)}{\\partial z^{(l)}_i} = \\sum^{s_{l+1}}_{k = 1} \\frac{\\partial J(\\theta)}{\\partial z^{(l+1)}_k} \\frac{\\partial z^{(l+1)}_k}{\\partial z^{(l)}_i}\n",
    "$$\n",
    "\n",
    "With:\n",
    "\n",
    "$$\n",
    "z^{(l+1)}_k = [ \\theta^{(l+1)}_k ]^T . g(z^{(l)}_p) = \\sum_{p=0}^{s_{l}} \\theta^{(l+1)}_{k,p} \\times g(z^{(l)}_p)\n",
    "$$\n",
    "\n",
    "And:\n",
    "\n",
    "$$\n",
    "\\frac{\\partial z^{(l+1)}_k}{\\partial z^{(l)}_i} = \\theta^{(l+1)}_{k,i} . g'(z^{(l)}_i)\\ \\  for\\ p\\ =\\ i,\\ else\\ 0.\n",
    "$$\n",
    "\n",
    "Hence:\n",
    "\n",
    "---\n",
    "\n",
    "$$\n",
    "\\delta^{(l)}_i \n",
    "=\n",
    "\\sum^{s_{l+1}}_{k = 0} \\delta^{(l+1)}_k . \\theta^{(l+1)}_{k,i} . g'(z^{(l)}_i) \n",
    "$$\n",
    "\n",
    "---\n",
    "\n",
    "### The backpropagation algorithm explained\n",
    "\n",
    "We have the following:\n",
    "- we have found a function ẟ for the layer l such that when we multiply this function by the output of the layer (l-1), we obtain the partial derivative of the cost function J on the weights θ of the layer l,\n",
    "- the function ẟ for the layer l has a relation with ẟ of the layer (l+1),\n",
    "- as we have a training set, we can compute the values of ẟ for the layer L.\n",
    "\n",
    "So, we start to compute the values of ẟ for the layer L. As we have a relation between ẟ for the layer l and ẟ for the layer (l+1), we can compute the values for the layers (L-1), (L-2), …, 2.\n",
    "\n",
    "We can then compute the partial derivative of the cost function J on the weights θ of the layer l, by multiplying ẟ for the layer l by the output of the layer (l-1).\n",
    "\n",
    "\n",
    "### The vectorized backpropagation’s equations\n",
    "\n",
    "The __first equation__ gives the partial derivatives of J with respect to θ. We have added the regularization term.\n",
    "\n",
    "$$\n",
    "\\nabla_{\\theta^{(l)}} J(\\theta) = \n",
    "\\begin{bmatrix}\n",
    "    \\frac{\\partial J}{\\partial \\theta^{(l)}_{1,0}} & \\dots & \\frac{\\partial J}{\\partial \\theta^{(l)}_{1,j}} & \\dots  & \\frac{\\partial J}{\\partial \\theta^{(l)}_{1,s_{l-1}}} \\\\\n",
    "    \\vdots       &       & \\vdots       &        & \\vdots \\\\\n",
    "    \\frac{\\partial J}{\\partial \\theta^{(l)}_{i,0}} & \\dots & \\frac{\\partial J}{\\partial \\theta^{(l)}_{i,j}} & \\dots  & \\frac{\\partial J}{\\partial \\theta^{(l)}_{i,s_{l-1}}} \\\\\n",
    "    \\vdots       &       & \\vdots       &        & \\vdots \\\\\n",
    "    \\frac{\\partial J}{\\partial \\theta^{(l)}_{s_l,0}} & \\dots & \\frac{\\partial J}{\\partial \\theta^{(l)}_{s_l,j}} & \\dots  & \\frac{\\partial J}{\\partial \\theta^{(l)}_{s_l,s_{l-1}}} \\\\\n",
    "\\end{bmatrix}\n",
    "= \\delta^{(l)} \\otimes [a^{(l-1)}]^T + \\frac{\\lambda}{m} \\theta^{(l)}\n",
    "$$\n",
    "\n",
    "The __second equation__ gives the relation between ẟ in layer l and ẟ in layer (l+1):\n",
    "\n",
    "$$\n",
    "\\delta^{(l)} = [(\\theta^{(l+1)})^T . \\delta^{(l+1)}] \\odot g'(z^l)\n",
    "$$\n",
    "\n",
    "The __third equation__ gives the value of ẟ for the layer L:\n",
    "\n",
    "$$\n",
    "\\delta^{(L)} = \\nabla_{a^{(L)}} J(\\theta) \\odot g'(z^L)\n",
    "$$\n",
    "\n",
    "\n",
    "## Conclusion\n",
    "\n",
    "This cheat sheet explains the backpropagation algorithm used to train a neural network. I have created this article after following the great [Machine Learning’s course of Professor Andrew Ng](https://www.coursera.org/learn/machine-learning) on Coursera. The conventions used in this article are not exactly the ones used in the course of Professor Ng, nor exactly the ones used in [the article of Michael Nielsen](http://neuralnetworksanddeeplearning.com/chap2.html).\n",
    "\n",
    "If you notice any error, please do not hesitate to contact me.  \n",
    "\n",
    "<div style=\"text-align: right\"> To Victor, Oscar and all those will follow </div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Cheat sheet on Neural Networks 1 \n",
	"\n",
	"The following article is a cheat sheet on neural networks. My sources are based on the following course and article:\n",
	"- the excellent [Machine Learning course](https://www.coursera.org/learn/machine-learning) on Coursera from Professor Andrew Ng, Stanford,\n",
	"- the very good [article from Michael Nielsen](http://neuralnetworksanddeeplearning.com/chap2.html), explaining the backpropagation algorithm.\n",
	"\n",
	"\n",
	"## Why the neural networks are powerful ?\n",
	"\n",
	"It is proven mathematically that: \n",
	"\n",
	"> Suppose we’re given a [continuous] function f(x) which we’d like to compute to within some desired accuracy ϵ>0. The guarantee is that by using enough hidden neurons we can always find a neural network whose output g(x) satisfies:\n",
	"\|g(x)−f(x)\|<ϵ, for all inputs x. \n",
	"\n",
	"_Michael Nielsen — From the following [article](http://neuralnetworksanddeeplearning.com/chap4.html)_\n",
	"\n",
	"## Conventions \n",
	"Let’s define a neural network with the following convention:\n",
	"\n",
	"L = total number of layers in the network. \n",
	"$s_l$ = number of units (not counting bias unit) in layer l. \n",
	"K = number of units in output layer ( = $s_L$ ). \n",
	"\n",
	"<img src=\"images/Neural_Network_definition.png\" />\n",
	"\n",
	"___With___:\n",
	"\n",
	"$$ \n",
	"\\forall l \\in\\ {2, ..., L},\\ a^{(l)} \\in \\mathbb{R}^{s_{l}+1} and\\ \n",
	"\\begin{cases}\n",
	"a^{(l)}_0\\ =\\ 1\\ (bias\\ node)\\\\\n",
	"a^{(l)}_i = g(z^{(l)}_i), \\forall i \\in {1, ..., s_l}\\\\ \n",
	"\\end{cases}\\\\\n",
	"z^{(l)}_i = [ \\theta^{(l)}_i ]^T . a^{(l-1)}, \\forall i \\in {1, ..., s_l}\\\\ \n",
	"with\\ \\theta^{(l)}_i \\in \\mathbb{R}^{s_{l-1}+1}\\ and\\ z^{(l)} \\in \\mathbb{R}^{s_{l}}\\ \n",
	"$$\n",
	"\n",
	"We define the matrix θ of the weights for the layer l as following:\n",
	"\n",
	"$$\n",
	"\\theta^{(l)} \\in \\mathbb{R}^{s_l \\times (s_{(l-1)}+1)}\n",
	"$$\n",
	"\n",
	"$$\n",
	"\\theta^{(l)} = \n",
	"\\begin{bmatrix}\n",
	" [ \\theta^{(l)}_1 ]^T \\\\\n",
	" [ \\theta^{(l)}_2 ]^T \\\\\n",
	" \\vdots \\\\\n",
	" [ \\theta^{(l)}_{s_{l}} ]^T\n",
	"\\end{bmatrix} =\n",
	"\\begin{bmatrix}\n",
	" \\theta_{1,0} & \\dots & \\theta_{1,j} & \\dots & \\theta_{1,s_{l-1}} \\\\\n",
	" \\vdots & & \\vdots & & \\vdots \\\\\n",
	" \\theta_{i,0} & \\dots & \\theta_{i,j} & \\dots & \\theta_{i,s_{l-1}} \\\\\n",
	" \\vdots & & \\vdots & & \\vdots \\\\\n",
	" \\theta_{s_l,0} & \\dots & \\theta_{s_l,j} & \\dots & \\theta_{s_l,s_{l-1}} \\\\\n",
	"\\end{bmatrix}\n",
	"$$\n",
	"\n",
	"Hence, we have the following relation: \n",
	"$$\n",
	"a^{(l)} =\n",
	"\\left(\\begin{smallmatrix}1 \\\\ g(\\theta^{(l)}.a^{(l-1)})\\end{smallmatrix}\\right)\n",
	"$$\n",
	"\n",
	"\n",
	"## The cost function of a Neural Network\n",
	"\n",
	"The training set is defined by: $ { (x^1,y^1), ..., (x^m,y^m) } $\n",
	"\n",
	"x and y are vectors, with respectively the same dimensions as the input and output layers of the neural network. \n",
	"\n",
	"The cost function of a neural network is the following:\n",
	"\n",
	"\n",
	"$$\n",
	"J(\\theta) = - \\frac{1}{m} \\sum_{i=1}^m \\sum_{k=1}^K \\left[ cost( a^{(L)}_k, y^{(i)}_k) \\right] + \\frac{\\lambda}{2m}\\sum_{l=2}^{L} \\sum_{j=0}^{s_l} \\sum_{i=1}^{s_{l+1}} ( \\theta_{i,j}^{(l)})^2\n",
	"$$\n",
	"\n",
	"$a^{(L)}_k$ is the output of the neural network, and is dependent of the weights 𝜃 of the neural network. \n",
	"\n",
	"Now, the objective is to train the neural network and find the minimum of the cost function J(𝜃).\n",
	"\n",
	"## Mathematic reminder: the chain rule\n",
	"\n",
	"Let’s define the functions f, g and h as following:\n",
	"\n",
	"$$ f:\\mathbb{R}^n \\rightarrow \\mathbb{R} $$\n",
	"\n",
	"$$ g:\\mathbb{R}^p \\rightarrow \\mathbb{R}^n $$\n",
	"\n",
	"$$ h = f \\circ g $$\n",
	"\n",
	"The derivative of h is given by the chain rule theorem:\n",
	"\n",
	"$$\n",
	"\\forall_i \\in \\{ \\!1, ... , \\!p \\}, \n",
	"\\frac{\\partial h}{\\partial x_i} = \n",
	"\\sum_{k=1}^{n} \\frac{\\partial f}{\\partial g_k} \\frac{\\partial g_k}{\\partial x_i}\n",
	"$$\n",
	"\n",
	"(See the following [course online](https://ocw.mit.edu/courses/mathematics/18-02sc-multivariable-calculus-fall-2010/2.-partial-derivatives/) on partial derivation from the MIT)\n",
	"\n",
	"\n",
	"## The backpropagation algorithm\n",
	"\n",
	"We use the __gradient descent__ to find the minimum of J on 𝜃: $ \\min\\limits_{\\theta} J(\\theta)$\n",
	"\n",
	"The gradient descent requires to compute: \n",
	"\n",
	"$$ \\frac{\\partial J(\\theta)}{\\partial \\theta^{(l)}_{i,j}} $$\n",
	"\n",
	"___In the following parts, we consider only the first part of J(θ) (as if the regularisation term λ=0). The partial derivative of the second term of J(θ) is easy to compute.___\n",
	"\n",
	"\n",
	"### Definition of ẟ\n",
	"\n",
	"Let’s define the function ẟ. When ẟ of the layer l is multiplied by the output of the layer (l-1), we obtain the partial derivative of the cost function on θ.\n",
	"\n",
	"Let’s use the chain rule and develop this derivative on z:\n",
	"\n",
	"$$ \n",
	"\\frac{\\partial J(\\theta)}{\\partial \\theta^{(l)}_{i,j}} \n",
	"=\n",
	"\\sum^{s_l}_{k = 1} \\frac{\\partial J(\\theta)}{\\partial z^{(l)}_k} \\frac{\\partial z^{(l)}_k}{\\partial \\theta^{(l)}_{i,j}}\n",
	"$$\n",
	"\n",
	"(Remind that J is dependent of z)\n",
	"\n",
	"As: \n",
	"$$z^{(l)}_k = [ \\theta^{(l)}_k ]^T . a^{(l-1)} = \\sum_{p=0}^{s_{l-1}} \\theta^{(l)}_{k,p} \\times a^{(l-1)}_p$$\n",
	"\n",
	"$$\\frac{\\partial z^{(l)}_k}{\\partial \\theta^{(l)}_{i,j}} = 0\\ for\\ k\\ ≠\\ i\\ and\\ p\\ ≠\\ j\\ in\\ the\\ sum.$$\n",
	"\n",
	"$$And\\ \\frac{\\partial z^{(l)}_k}{\\partial \\theta^{(l)}_{i,j}} = a^{(l-1)}_j\\ for\\ k\\ =\\ i\\ and\\ p\\ =\\ j\\ in\\ the\\ sum.$$\n",
	"\n",
	"We define the __output error 𝛿__: \n",
	"$$ \n",
	"\\delta^{(l)}_k = \n",
	"\\frac{\\partial J(\\theta)}{\\partial z^{(l)}_k},\\ \\forall k \\in {1, ..., s_l},\\\n",
	"and\\ \n",
	"\\delta^{(l)} = \\nabla_{z^{(l)}} J(\\theta) \\in \\mathbb{R}^{s_{l}}\n",
	"$$\n",
	"\n",
	"So we have:\n",
	"\n",
	"---\n",
	"\n",
	"$$ \n",
	"\\frac{\\partial J(\\theta)}{\\partial \\theta^{(l)}_{i,j}} \n",
	"=\n",
	"\\delta^{(l)}_i . a^{(l-1)}_j\n",
	"$$\n",
	"\n",
	"---\n",
	"\n",
	"### Value of ẟ for the layer L\n",
	"\n",
	"Now let’s find 𝛿 for the output layer (layer L):\n",
	"\n",
	"$$\n",
	"\\delta^L_i = \\frac{\\partial J(\\theta)}{\\partial z^{(L)}_i} = \\sum^{s_{L}}_{k = 0} \\frac{\\partial J(\\theta)}{\\partial a^{(L)}_k} \\frac{\\partial a^{(L)}_k}{\\partial z^{(L)}_i}\n",
	"$$\n",
	"\n",
	"As:\n",
	"\n",
	"$$\n",
	"a^{(l)}_k = g(z^{(l)}_k),\\ \\frac{\\partial a^{(L)}_k}{\\partial z^{(L)}_i} = 0\\ if\\ k\\ ≠\\ i\n",
	"$$\n",
	"\n",
	"Hence:\n",
	"\n",
	"---\n",
	"\n",
	"$$\n",
	"\\delta^L_i = \n",
	"\\frac{\\partial J(\\theta)}{\\partial a^{(L)}_i} \\frac{\\partial g(z^{(L)}_i)}{\\partial z^{(L)}_i} =\n",
	"\\frac{\\partial J(\\theta)}{\\partial a^{(L)}_i} . g'(z^{(L)}_i)\n",
	"$$\n",
	"\n",
	"---\n",
	"\n",
	"### Relationship of ẟ between the layer l and the layer (l-1)\n",
	"\n",
	"Now, we try to find a relation between 𝛿 of the layer l and 𝛿 of the next layer (l+1):\n",
	"\n",
	"$$\n",
	"\\delta^{(l)}_i = \\frac{\\partial J(\\theta)}{\\partial z^{(l)}_i} = \\sum^{s_{l+1}}_{k = 1} \\frac{\\partial J(\\theta)}{\\partial z^{(l+1)}_k} \\frac{\\partial z^{(l+1)}_k}{\\partial z^{(l)}_i}\n",
	"$$\n",
	"\n",
	"With:\n",
	"\n",
	"$$\n",
	"z^{(l+1)}_k = [ \\theta^{(l+1)}_k ]^T . g(z^{(l)}_p) = \\sum_{p=0}^{s_{l}} \\theta^{(l+1)}_{k,p} \\times g(z^{(l)}_p)\n",
	"$$\n",
	"\n",
	"And:\n",
	"\n",
	"$$\n",
	"\\frac{\\partial z^{(l+1)}_k}{\\partial z^{(l)}_i} = \\theta^{(l+1)}_{k,i} . g'(z^{(l)}_i)\\ \\ for\\ p\\ =\\ i,\\ else\\ 0.\n",
	"$$\n",
	"\n",
	"Hence:\n",
	"\n",
	"---\n",
	"\n",
	"$$\n",
	"\\delta^{(l)}_i \n",
	"=\n",
	"\\sum^{s_{l+1}}_{k = 0} \\delta^{(l+1)}_k . \\theta^{(l+1)}_{k,i} . g'(z^{(l)}_i) \n",
	"$$\n",
	"\n",
	"---\n",
	"\n",
	"### The backpropagation algorithm explained\n",
	"\n",
	"We have the following:\n",
	"- we have found a function ẟ for the layer l such that when we multiply this function by the output of the layer (l-1), we obtain the partial derivative of the cost function J on the weights θ of the layer l,\n",
	"- the function ẟ for the layer l has a relation with ẟ of the layer (l+1),\n",
	"- as we have a training set, we can compute the values of ẟ for the layer L.\n",
	"\n",
	"So, we start to compute the values of ẟ for the layer L. As we have a relation between ẟ for the layer l and ẟ for the layer (l+1), we can compute the values for the layers (L-1), (L-2), …, 2.\n",
	"\n",
	"We can then compute the partial derivative of the cost function J on the weights θ of the layer l, by multiplying ẟ for the layer l by the output of the layer (l-1).\n",
	"\n",
	"\n",
	"### The vectorized backpropagation’s equations\n",
	"\n",
	"The __first equation__ gives the partial derivatives of J with respect to θ. We have added the regularization term.\n",
	"\n",
	"$$\n",
	"\\nabla_{\\theta^{(l)}} J(\\theta) = \n",
	"\\begin{bmatrix}\n",
	" \\frac{\\partial J}{\\partial \\theta^{(l)}_{1,0}} & \\dots & \\frac{\\partial J}{\\partial \\theta^{(l)}_{1,j}} & \\dots & \\frac{\\partial J}{\\partial \\theta^{(l)}_{1,s_{l-1}}} \\\\\n",
	" \\vdots & & \\vdots & & \\vdots \\\\\n",
	" \\frac{\\partial J}{\\partial \\theta^{(l)}_{i,0}} & \\dots & \\frac{\\partial J}{\\partial \\theta^{(l)}_{i,j}} & \\dots & \\frac{\\partial J}{\\partial \\theta^{(l)}_{i,s_{l-1}}} \\\\\n",
	" \\vdots & & \\vdots & & \\vdots \\\\\n",
	" \\frac{\\partial J}{\\partial \\theta^{(l)}_{s_l,0}} & \\dots & \\frac{\\partial J}{\\partial \\theta^{(l)}_{s_l,j}} & \\dots & \\frac{\\partial J}{\\partial \\theta^{(l)}_{s_l,s_{l-1}}} \\\\\n",
	"\\end{bmatrix}\n",
	"= \\delta^{(l)} \\otimes [a^{(l-1)}]^T + \\frac{\\lambda}{m} \\theta^{(l)}\n",
	"$$\n",
	"\n",
	"The __second equation__ gives the relation between ẟ in layer l and ẟ in layer (l+1):\n",
	"\n",
	"$$\n",
	"\\delta^{(l)} = [(\\theta^{(l+1)})^T . \\delta^{(l+1)}] \\odot g'(z^l)\n",
	"$$\n",
	"\n",
	"The __third equation__ gives the value of ẟ for the layer L:\n",
	"\n",
	"$$\n",
	"\\delta^{(L)} = \\nabla_{a^{(L)}} J(\\theta) \\odot g'(z^L)\n",
	"$$\n",
	"\n",
	"\n",
	"## Conclusion\n",
	"\n",
	"This cheat sheet explains the backpropagation algorithm used to train a neural network. I have created this article after following the great [Machine Learning’s course of Professor Andrew Ng](https://www.coursera.org/learn/machine-learning) on Coursera. The conventions used in this article are not exactly the ones used in the course of Professor Ng, nor exactly the ones used in [the article of Michael Nielsen](http://neuralnetworksanddeeplearning.com/chap2.html).\n",
	"\n",
	"If you notice any error, please do not hesitate to contact me. \n",
	"\n",
	"<div style=\"text-align: right\"> To Victor, Oscar and all those will follow </div>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.1"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 1
	}