petered · September 12, 2018 14:40
diff --git a/why-alignment b/why-alignment
 $$
 \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
 \newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}
 \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
 \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}
 \newcommand{\switch}[3]{\begin{cases} #2 & \text{if } {#1} \\ #3 &\text{otherwise}\end{cases}}
 \newcommand{\blue}[1]{\color{blue}{#1}}
 \newcommand{\red}[1]{\color{red}{#1}}
 \newcommand{\overlabel}[2]{\overset{#1}{\overline{#2}}}
 \newcommand{\loss}[0]{\mathcal L}
 $$

 This exploration started when we were trying to optimize a network with layerwise targets.  We observed that we were getting the same performance when just optimizing local losses as were were from optimizing global loss.

 # The Alignment Surprise:

 We have a network defined by: 

 \begin{align}
 s_1 &= h(x, \phi_1) \\
 s_2 &= h(s_1, \phi_2) \\
 ... \\
 s_L &= h(s_{L-1}, \phi_L) \\
 \end{align}

 We then define a set of layerwise targets $s_1^*, ... s_L^*$, and define a loss according to the distance from these targets:

 $$
 \mathcal L = \sum_{i=1}^L \ell_i = \sum_{i=1}^L\|s_i - s_i^*\|^2 
 $$

 For a given layer's parameters $\phi_i$, the loss-gradient is the sum of a "local" and "distant" components: 

 \begin{align}
 \lderiv{\phi_i} &= \overlabel{global}{ \overlabel{local}{\pderiv{\ell_i}{\phi_i}} + \sum_{j:j>i} \overlabel{distant}{\pderiv{\ell_j}{\phi_i}}}
 =  \overlabel{global}{ \overlabel{local}{\pderiv{\ell_i}{s_i}\pderiv{s_i}{\phi_i}} + \sum_{j:j>i} \overlabel{distant}{\pderiv{\ell_j}{s_j}\pderiv{s_j}{s_i}\pderiv{\mathcal s_i}{\phi_i}}}
 \end{align}

 The **surprise** is that we empirically find that not only does our *local* gradient usually align with the *global* gradient, but also with the *distant* gradients. 

 i.e. 
 $$
 \mathcal S\left(\pderiv{\ell_i}{\phi_i}, \pderiv{\ell_j}{\phi_i}\right) \overset{usually}> 0 : j>i
 $$
 Where $\mathcal S$ is the cosine-similarity metric.  For example, with a randomly generated twenty-layer network with random targets, it is almost always the case that all $\frac{19\cdot 20}{2}$ pairwise inter-layer gradients are positively aligned.  The strength of the similarity ranges from ~0.65 when $j=i+1$ to ~0.14 when $j=i+19$.

 This is unintuitive: Why should the gradient due to the losses in downstream layers happen to align with the gradient of the local layer's loss?  Remember the "targets" we are using for the loss are just randomly generated points.


 # Understanding the alignment

 Suppose our we have a two layer linear network with randomly-drawn parameters $w$ which produces activations $s$:

 \begin{align}
 s_1 &= w_1 x \\
 s_2 &= w_2 s_1
 \end{align}

 We randomly generate targets $s^*$, and define layerwise losses:

 \begin{align}
 \ell_1 = \|s_1- s_1^*  \|^2 \\
 \ell_2 = \|s_2 - s_2^* \|^2
 \end{align}

 We observe that:

 $$
 \mathcal S\left(\pderiv{\ell_1}{w_1}, \pderiv{\ell_2}{w_1}\right) \overset{usually}> 0
 $$
 Where $\mathcal S$ is the cosine-similarity.  It seems that our "local" gradient $\pderiv{\ell_1}{w_1}$ tends to be aligned with our "distant" gradient $\pderiv{\ell_2}{w_1}$.

 # So what is going on?

 Lets write out the gradients:

 \begin{align}
 local: \pderiv{\ell_1}{w_1}^T &= \pderiv{\ell_1}{s_1} \pderiv{s_1}{w_1} \\
 &= (s_1 - s_1^*)^T x \\
 &= \overlabel{internal_1}{s_1^T x} - \overlabel{external_1}{s_1^{*T} x} \\
 distant: \pderiv{\ell_2}{w_1}^T &= \pderiv{\ell_1}{s_2} \pderiv{s_2}{s_1} \pderiv{s_1}{w_1} \\
 &= ((s_2 - s_2^*) \cdot w_2^T)^T x \\
 &= \overlabel{internal_2}{((s_1 \cdot w_2) \cdot w_2^T)^T x} - \overlabel{external_2}{(s_2^* w_2^T)^T  x} \\
 \end{align}

 Here we can see the cause of the alignment.  Each gradient is composed of two terms - and *internal* term which just depends on the input, and an *external* term which also depends on the target.  While the *external* terms are multiplied by the (arbitrary) target, the *internal* terms of the two loss-gradients tend to be aligned.

 Namely, if $w_2$ acts like an autoencoder for inputs $s_1$, then $s_1w_2 w_2 ^T \approx s_1$, and therefore the internal terms of the two loss-gradients align:  $s_1 w_2 w_2^T x \propto s_1 x$.  

 The alignment happens so long as:

 \begin{align}
 \left(\overlabel{in_1}{s_1^T x} - \overlabel{ex_1}{s_1^{*T} x},\right)\cdot \left( \overlabel{in_2}{((s_1 \cdot w_2) \cdot w_2^T)^T x} - \overlabel{ex_2}{(s_2^* w_2^T)^T  x}\right) &> 0 \\
 in_1\cdot in_2 - in_1\cdot ex_2 - in_2\cdot ex_2 + ex_2\cdot ex_1 &> 0
 \end{align}

 So alignment is strongest when both internal terms aligned with eachother, the external terms are aligned with eachother, and the internal term from one layer is anti-aligned with the external term from the other layer.


 # What does this tell us?
	$$
	\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
	\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}
	\newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
	\newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}
	\newcommand{\switch}[3]{\begin{cases} #2 & \text{if } {#1} \\ #3 &\text{otherwise}\end{cases}}
	\newcommand{\blue}[1]{\color{blue}{#1}}
	\newcommand{\red}[1]{\color{red}{#1}}
	\newcommand{\overlabel}[2]{\overset{#1}{\overline{#2}}}
	\newcommand{\loss}[0]{\mathcal L}
	$$

	This exploration started when we were trying to optimize a network with layerwise targets. We observed that we were getting the same performance when just optimizing local losses as were were from optimizing global loss.

	# The Alignment Surprise:

	We have a network defined by:

	\begin{align}
	s_1 &= h(x, \phi_1) \\
	s_2 &= h(s_1, \phi_2) \\
	... \\
	s_L &= h(s_{L-1}, \phi_L) \\
	\end{align}

	We then define a set of layerwise targets $s_1^, ... s_L^$, and define a loss according to the distance from these targets:

	$$
	\mathcal L = \sum_{i=1}^L \ell_i = \sum_{i=1}^L\\|s_i - s_i^*\\|^2
	$$

	For a given layer's parameters $\phi_i$, the loss-gradient is the sum of a "local" and "distant" components:

	\begin{align}
	\lderiv{\phi_i} &= \overlabel{global}{ \overlabel{local}{\pderiv{\ell_i}{\phi_i}} + \sum_{j:j>i} \overlabel{distant}{\pderiv{\ell_j}{\phi_i}}}
	= \overlabel{global}{ \overlabel{local}{\pderiv{\ell_i}{s_i}\pderiv{s_i}{\phi_i}} + \sum_{j:j>i} \overlabel{distant}{\pderiv{\ell_j}{s_j}\pderiv{s_j}{s_i}\pderiv{\mathcal s_i}{\phi_i}}}
	\end{align}

	The surprise is that we empirically find that not only does our local gradient usually align with the global gradient, but also with the distant gradients.

	i.e.
	$$
	\mathcal S\left(\pderiv{\ell_i}{\phi_i}, \pderiv{\ell_j}{\phi_i}\right) \overset{usually}> 0 : j>i
	$$
	Where $\mathcal S$ is the cosine-similarity metric. For example, with a randomly generated twenty-layer network with random targets, it is almost always the case that all $\frac{19\cdot 20}{2}$ pairwise inter-layer gradients are positively aligned. The strength of the similarity ranges from ~0.65 when $j=i+1$ to ~0.14 when $j=i+19$.

	This is unintuitive: Why should the gradient due to the losses in downstream layers happen to align with the gradient of the local layer's loss? Remember the "targets" we are using for the loss are just randomly generated points.


	# Understanding the alignment

	Suppose our we have a two layer linear network with randomly-drawn parameters $w$ which produces activations $s$:

	\begin{align}
	s_1 &= w_1 x \\
	s_2 &= w_2 s_1
	\end{align}

	We randomly generate targets $s^*$, and define layerwise losses:

	\begin{align}
	\ell_1 = \\|s_1- s_1^* \\|^2 \\
	\ell_2 = \\|s_2 - s_2^* \\|^2
	\end{align}

	We observe that:

	$$
	\mathcal S\left(\pderiv{\ell_1}{w_1}, \pderiv{\ell_2}{w_1}\right) \overset{usually}> 0
	$$
	Where $\mathcal S$ is the cosine-similarity. It seems that our "local" gradient $\pderiv{\ell_1}{w_1}$ tends to be aligned with our "distant" gradient $\pderiv{\ell_2}{w_1}$.

	# So what is going on?

	Lets write out the gradients:

	\begin{align}
	local: \pderiv{\ell_1}{w_1}^T &= \pderiv{\ell_1}{s_1} \pderiv{s_1}{w_1} \\
	&= (s_1 - s_1^*)^T x \\
	&= \overlabel{internal_1}{s_1^T x} - \overlabel{external_1}{s_1^{*T} x} \\
	distant: \pderiv{\ell_2}{w_1}^T &= \pderiv{\ell_1}{s_2} \pderiv{s_2}{s_1} \pderiv{s_1}{w_1} \\
	&= ((s_2 - s_2^*) \cdot w_2^T)^T x \\
	&= \overlabel{internal_2}{((s_1 \cdot w_2) \cdot w_2^T)^T x} - \overlabel{external_2}{(s_2^* w_2^T)^T x} \\
	\end{align}

	Here we can see the cause of the alignment. Each gradient is composed of two terms - and internal term which just depends on the input, and an external term which also depends on the target. While the external terms are multiplied by the (arbitrary) target, the internal terms of the two loss-gradients tend to be aligned.

	Namely, if $w_2$ acts like an autoencoder for inputs $s_1$, then $s_1w_2 w_2 ^T \approx s_1$, and therefore the internal terms of the two loss-gradients align: $s_1 w_2 w_2^T x \propto s_1 x$.

	The alignment happens so long as:

	\begin{align}
	\left(\overlabel{in_1}{s_1^T x} - \overlabel{ex_1}{s_1^{T} x},\right)\cdot \left( \overlabel{in_2}{((s_1 \cdot w_2) \cdot w_2^T)^T x} - \overlabel{ex_2}{(s_2^ w_2^T)^T x}\right) &> 0 \\
	in_1\cdot in_2 - in_1\cdot ex_2 - in_2\cdot ex_2 + ex_2\cdot ex_1 &> 0
	\end{align}

	So alignment is strongest when both internal terms aligned with eachother, the external terms are aligned with eachother, and the internal term from one layer is anti-aligned with the external term from the other layer.


	# What does this tell us?