Back propagation algorithm

Notations

Input
- $\vb{x}$: input (vector of features)
- $\vb{y}$: target output
- $C(\vb{y}, f^L(W^L f^L-1(W^L-1 \cdots f^1(W^1 σ(x))\cdots)))$: loss function or “cost function”
- $L$: number of layers
- $\mat{W^\mit{l}} = (w^l_jk)$: the weights between layer $l - 1$ and $l$, where $w^l_jk$ is the weight between the $k$ -th node in layer $l - 1$ and the $j$ -th node in layer $l$
- $f^l$ or $σ^l$ : activation_function at layer $l$
Intermediary functions
- weighted input: $$ \vb{z}^l ≡ \mat{W}^l \vb{a}^l-1 + \vb{b}^l $$
- activation result at layer $L$: $$ \vb{a}^L ≡ σ(\vb{z}^L) = σ(\mat{W}^L \vb{a}^L-1 + \vb{b}^L) $$
- the gradient of the weighted input of layer $L$: $$\vb{δ}^L ≡ \pdv{C}{\vb{z}^L} $$
Parameters
- $ε$: target precision
- $γ$: step size
- $\hat{ι}$: max iterations

Basic equations

An equation for the error in the output layer: \begin{equation} \label{eq1} \vb{δ}^l = ∇_a C ⋅ σ’(\vb{z}^l)} \end{equation}
An equation for the error $δ^l$ in terms of the error in the next layer: \begin{equation} \label{eq2}\tag{2} \begin{align*} &δ^l = ((\mat{w}^l+1)^T δ^l+1) o σ’(z^l)
&i.e.\ &δ^l_j = ∑_k w^l+1_kj δ^l+1_k σ’(z^l_j) \end{align*} \end{equation}
An equation for the rate of change of the cost with respect to any bias in the network: \begin{equation} \label{eq3}\tag{3} \pdv{c}{b^l} = δ^l \end{equation}
An equation for the rate of change of the cost with respect to any weight in the network: \begin{equation} \tag{4} \begin{align*} \pdv{c}{\mat{w}} &= \vb{\vb{δ}}^l ⋅ (\vb{a}^l-1)^T
i.e.\ \frac{∂ C}{∂ w^l_jk} &= a^l-1_k δ^l_j \end{align*} \end{equation}

Algorithm

\begin{algorithm} \SetAlgoLined \KwData{$x,y,C,\vb{W}_0, \vb{b}_0, f^l; \; given: σ, ε, γ, \hat{ι}$} \KwResult{\mat{W}^* \: s.t. \: C(x,y;\mat{W}^*) < ε \; or \: ∅} Set the corresponding activation $a^1 = σ(x)$ for the input layer\; $\mit{i} ← 0$\; successFlag $←$ \False\; \While{$\mit{i} < \hat{ι} $}{ \tcp{Feed forward} \For {$l ← 2,3,\ldots,\mathrm{L}$}{ compute $\vb{z}^l ≡ \mat{W}^l \vb{a}^l-1 + \vb{b}^l$ \; compute $ \vb{a}^L ≡ σ(\vb{z}^L) $ \; } \tcp{Compute cost function} compute: $\mit{c} = C(\vb{a}, y)$\; \If{${c} < ε}{ successFlag $←$ \True\; \Break\; }\Else{ \tcp{Output error at last layer} compute: $\vb{δ}^L = \grad{\vb{a}}{C} o σ’(\vb{z}^L)$\; \tcp{Backpropagate the error} \For{$l ← L-1, L-2, \ldots, 2$}{ compute: $\vb{δ}^l = ((\mat{w}^l+1)^T \vb{δ}^l+1) o σ’(\vb{z}^l)\;$\; \tcp{Compute weight and bias gradient} compute: $\pdv{C}{\mat{W^l}} &= \vb{δ}^l ⋅ (\vb{a}^l-1)^T$ and $\pdv{C}{\vb{b}^l} &= \vb{δ}^l$\; \tcp{update $\mat{W^\mit{l}}, \vb{b}^l$ } compute: $ \mat{W}^\mit{l} ← \mat{W}^\mit{l} + γ ⋅ \pdv{C}{\mat{W}^\mit{l}} $ and $ \vb{b}^l ← \vb{b}^l + γ \pdv{C}{\vb{b}^l} $\; } } $ \mit{i} ← \mit{i} + 1 $\; } \If{successFlag}{ \Return{$(\mat{W}^1, \ldots, \mat{W}^L)$ as $\mat{W}^∗$} } \Else { \Return{$∅$} } \caption{Neural network backpropagation algorithm} \end{algorithm}

tlkahn/bp.org

Back propagation algorithm

Notations

Basic equations

Algorithm