Skip to content

Instantly share code, notes, and snippets.

@petered
Created September 12, 2018 14:40
Show Gist options
  • Save petered/2c4722e2ddcb1cb34e42adabaa9c78c0 to your computer and use it in GitHub Desktop.
Save petered/2c4722e2ddcb1cb34e42adabaa9c78c0 to your computer and use it in GitHub Desktop.
2018-09-12 Why Alignment?
$$
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}
\newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}
\newcommand{\switch}[3]{\begin{cases} #2 & \text{if } {#1} \\ #3 &\text{otherwise}\end{cases}}
\newcommand{\blue}[1]{\color{blue}{#1}}
\newcommand{\red}[1]{\color{red}{#1}}
\newcommand{\overlabel}[2]{\overset{#1}{\overline{#2}}}
\newcommand{\loss}[0]{\mathcal L}
$$
This exploration started when we were trying to optimize a network with layerwise targets. We observed that we were getting the same performance when just optimizing local losses as were were from optimizing global loss.
# The Alignment Surprise:
We have a network defined by:
\begin{align}
s_1 &= h(x, \phi_1) \\
s_2 &= h(s_1, \phi_2) \\
... \\
s_L &= h(s_{L-1}, \phi_L) \\
\end{align}
We then define a set of layerwise targets $s_1^*, ... s_L^*$, and define a loss according to the distance from these targets:
$$
\mathcal L = \sum_{i=1}^L \ell_i = \sum_{i=1}^L\|s_i - s_i^*\|^2
$$
For a given layer's parameters $\phi_i$, the loss-gradient is the sum of a "local" and "distant" components:
\begin{align}
\lderiv{\phi_i} &= \overlabel{global}{ \overlabel{local}{\pderiv{\ell_i}{\phi_i}} + \sum_{j:j>i} \overlabel{distant}{\pderiv{\ell_j}{\phi_i}}}
= \overlabel{global}{ \overlabel{local}{\pderiv{\ell_i}{s_i}\pderiv{s_i}{\phi_i}} + \sum_{j:j>i} \overlabel{distant}{\pderiv{\ell_j}{s_j}\pderiv{s_j}{s_i}\pderiv{\mathcal s_i}{\phi_i}}}
\end{align}
The **surprise** is that we empirically find that not only does our *local* gradient usually align with the *global* gradient, but also with the *distant* gradients.
i.e.
$$
\mathcal S\left(\pderiv{\ell_i}{\phi_i}, \pderiv{\ell_j}{\phi_i}\right) \overset{usually}> 0 : j>i
$$
Where $\mathcal S$ is the cosine-similarity metric. For example, with a randomly generated twenty-layer network with random targets, it is almost always the case that all $\frac{19\cdot 20}{2}$ pairwise inter-layer gradients are positively aligned. The strength of the similarity ranges from ~0.65 when $j=i+1$ to ~0.14 when $j=i+19$.
This is unintuitive: Why should the gradient due to the losses in downstream layers happen to align with the gradient of the local layer's loss? Remember the "targets" we are using for the loss are just randomly generated points.
# Understanding the alignment
Suppose our we have a two layer linear network with randomly-drawn parameters $w$ which produces activations $s$:
\begin{align}
s_1 &= w_1 x \\
s_2 &= w_2 s_1
\end{align}
We randomly generate targets $s^*$, and define layerwise losses:
\begin{align}
\ell_1 = \|s_1- s_1^* \|^2 \\
\ell_2 = \|s_2 - s_2^* \|^2
\end{align}
We observe that:
$$
\mathcal S\left(\pderiv{\ell_1}{w_1}, \pderiv{\ell_2}{w_1}\right) \overset{usually}> 0
$$
Where $\mathcal S$ is the cosine-similarity. It seems that our "local" gradient $\pderiv{\ell_1}{w_1}$ tends to be aligned with our "distant" gradient $\pderiv{\ell_2}{w_1}$.
# So what is going on?
Lets write out the gradients:
\begin{align}
local: \pderiv{\ell_1}{w_1}^T &= \pderiv{\ell_1}{s_1} \pderiv{s_1}{w_1} \\
&= (s_1 - s_1^*)^T x \\
&= \overlabel{internal_1}{s_1^T x} - \overlabel{external_1}{s_1^{*T} x} \\
distant: \pderiv{\ell_2}{w_1}^T &= \pderiv{\ell_1}{s_2} \pderiv{s_2}{s_1} \pderiv{s_1}{w_1} \\
&= ((s_2 - s_2^*) \cdot w_2^T)^T x \\
&= \overlabel{internal_2}{((s_1 \cdot w_2) \cdot w_2^T)^T x} - \overlabel{external_2}{(s_2^* w_2^T)^T x} \\
\end{align}
Here we can see the cause of the alignment. Each gradient is composed of two terms - and *internal* term which just depends on the input, and an *external* term which also depends on the target. While the *external* terms are multiplied by the (arbitrary) target, the *internal* terms of the two loss-gradients tend to be aligned.
Namely, if $w_2$ acts like an autoencoder for inputs $s_1$, then $s_1w_2 w_2 ^T \approx s_1$, and therefore the internal terms of the two loss-gradients align: $s_1 w_2 w_2^T x \propto s_1 x$.
The alignment happens so long as:
\begin{align}
\left(\overlabel{in_1}{s_1^T x} - \overlabel{ex_1}{s_1^{*T} x},\right)\cdot \left( \overlabel{in_2}{((s_1 \cdot w_2) \cdot w_2^T)^T x} - \overlabel{ex_2}{(s_2^* w_2^T)^T x}\right) &> 0 \\
in_1\cdot in_2 - in_1\cdot ex_2 - in_2\cdot ex_2 + ex_2\cdot ex_1 &> 0
\end{align}
So alignment is strongest when both internal terms aligned with eachother, the external terms are aligned with eachother, and the internal term from one layer is anti-aligned with the external term from the other layer.
# What does this tell us?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment