petered · September 27, 2017 06:51
diff --git a/iterated-matrix-decomposition b/iterated-matrix-decomposition
 $\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}$
 $\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}}$
 $\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}$
 $\newcommand{\norm}[1]{\frac12\| #1 \|_2^2}$
 $\newcommand{argmax}[1]{\underset{#1}{\operatorname{argmax}}}$
 $\newcommand{argmin}[1]{\underset{#1}{\operatorname{argmin}}}$
 $\newcommand{blue}[1]{\color{blue}{#1}}$
 $\newcommand{red}[1]{\color{red}{#1}}$
 $\newcommand{argmax}[1]{\underset{#1}{\operatorname{argmax}}}$
 $\newcommand{argmin}[1]{\underset{#1}{\operatorname{argmin}}}$
 $\newcommand{\numel}[1]{|#1|}$


 $\newcommand{\oversize}[3]{\overset{\big[#2 \times #3 \big]}{#1}}$

 $\newcommand{\pderivdim}[4]{\overset{\big[#3 \times #4 \big]}{\frac{\partial #1}{\partial #2}}}$

 # Forward Gradient Estimates for Online Learning

 ## Motivation

 In Real Time Recurrent Learning, we update states with forward-mode automatic differentiation:

 $$
 \pderivdim{s_t}{\theta}{|S|}{|\Theta|} = \oversize{\pderiv{s_t}{s_{t-1}}}{|S|}{|S|} \oversize{\pderiv {s_{t-1}}{\theta}}{|S|}{|\Theta|} + \oversize{ \left.\pderiv{s_t}{\theta}\right|_{s_{t-1}}}{|S|}{|\Theta|}
 $$

 Where $\pderiv{s_t}{\theta}|_{s_{t-1}}$ denotes "gradient of $s_t$ with respect to $\theta$ while holding $s_{t-1}$ constant".

 The [UORO paper](https://arxiv.org/abs/1702.05043) makes a Rank-1 approximation:

 $$
 \pderiv{s_t}{\theta} \approx \tilde s \otimes \tilde \theta
 $$

 Where $\tilde s\in \mathbb R^{|S|}$, $\tilde \theta \in \mathbb R^{|\Theta|}$ are random variables defined such that $\mathbb E[\tilde s \otimes \tilde \theta] = \pderiv{s_t}{\theta}$.  In other words, they give an unbiased estimate of the gradient.  

 We want to create an algorithm that also generates these $(\tilde s, \tilde \theta)$ pairs, but in a deterministic way, so that they converge faster (at a rate of $1/t$ rather than $1/\sqrt t$) to the true gradient.

 In the following, to simplify the problem, we reformulate the variables to remove references to gradients.  We will rename variables as follows:
 \begin{align}
 \pderiv {s_t}{\theta}\in \mathbb R^{|S|\times |\Theta|} &\rightarrow A_t \in \mathbb R ^{M \times N}\\
 \tilde s_t \in\mathbb R^{|S|} &\rightarrow u_t \in \mathbb R ^M\\
 \tilde \theta_t \in\mathbb R^{|\Theta|} &\rightarrow v_t \in \mathbb R ^N\\
 \end{align}

 ## The problem

 Given a sequence of matrices: 

 $$
 A_t \in \mathbb R^{M\times N}: t\in \mathbb N
 $$

 We are looking for a process which produces a series of Rank-1 Approximations:

 \begin{align}
 A_t \approx u_t \otimes v_t : u_t \in \mathbb R^M , v_t \in \mathbb R^N
 \end{align}

 Such that:

 1. The $(u_t, v_t)$ pairs are generated in *online* (i.e. $(u_t, v_t)$ are computed before $A_{t+1}$ becomes available)
 2. a. The mean error converges for any $A_t: t\in \mathbb N$:
 \begin{align}
 \lim_{T\rightarrow \infty} \frac1T \left\| \sum_t^T \left(A_t - u_t \otimes v_t\right)\right\|  = 0
 \end{align}
 b. The mean error converges faster than $1/\sqrt T$.  (A stricter requirement):
 \begin{align}
 \lim_{T\rightarrow \infty} \frac1{\sqrt{T}} \left\| \sum_t^T \left(A_t - u_t \otimes v_t\right)\right\|  = 0
 \end{align}
 3. Our process stores less than $\mathcal O(M\cdot N)$ memory between iterations.  Basically our end-goal is to find a cheaper (computational and memory-wise) version of RTRL, and this goal is undermined if we must store large matrices.  TODO: Be a little clearer on our actual requirement here.

 ## Solutions

 ### 1. UORO:

 Algorithm: 
 \begin{align}
 \nu &\leftarrow U(\{-1, +1\})^M \\
 u_t &\leftarrow \nu \\
 v_t &\leftarrow \nu \cdot A_t
 \end{align}

 Conditions:
 1. **Pass**: Clearly online
 2a. **Pass**: Must converge, as $\mathbb E[u_t\otimes v_t] = A_t$ (as shown in [the paper](https://arxiv.org/abs/1702.05043))
 2b. **Fail Always**: See experiments: Convergence is $1/\sqrt t$, which is too slow.
 3. **Pass**, no memory between iterations.

 ### 2. Orthogonal Sequences

 This is a small variation on UORO.  The intuition is that UORO drew random samples of A which averaged out over time.  By being more clever about our $\nu$ sequence, we can explicitly cancel out the noise in these samples, leading to $1/t$ (instead of $1/\sqrt t)$ convergence.

 We generate a random orthonormal basis $B\in \mathbb R^{M\times M}$.  We then generate our Rank-1 approximations as:

 \begin{align}
 \nu &\leftarrow \sqrt M \cdot B_{\bullet, mod(t, M)}\\
 u_t &\leftarrow \nu \\
 v_t &\leftarrow \nu \cdot A_t
 \end{align}

 Conditions: 
 1. **Pass**: clearly online
 2a. **Fail sometimes**: Fails with periodic $A_t$ (see experiment)
 2b: **Fail sometimes**: Fails with periodic $A_t$ (see experiment)
 3: **Pass** (though we do have to store a fixed $M\times M$ matrix $B$)

 ### 3. Iterated Convergence
 The intuition behind this one is that we alternate between greedily minimizing $u$ given $v$, and $v$ given $u$.  

 \begin{align}
 v_0 &\leftarrow \mathcal N(0, 1)^N \\
 A' &\leftarrow 0^{M\times N} \\\\
 \\
 A' &\leftarrow A'+ A_t \\
 u_t &\leftarrow \argmin{u_t}\left|A'-u_t\otimes v_{t-1}\right| = \frac{A'\cdot v_{t-1}}{\|v_{t-1}\|^2}\\
 u_t &\leftarrow \frac{u_t}{\|u_t\|} \text{  ... for numerical stability}\\
 v_t &\leftarrow \argmin{v_t}\left|A'-u_t\otimes v_t\right| = \frac{u_t\cdot A'}{\|u_{t}\|^2} = u_t\cdot A' \\
 A' &\leftarrow A' - u_t\otimes v_t
 \end{align}

 (Note the similarity to herding / Sigma-Delta modulation):

 \begin{align}
 \phi &\leftarrow \phi + x \\
 s &\leftarrow \argmin{s\in\mathbb Z} |\phi-s| \\
 \phi &\leftarrow  \phi - s\\
 \end{align}

 Conditions: 
 1. **Pass**: clearly online
 2a. **Pass, Probably**: See experiment
 2b: **Pass, Probably**: Converges at rate of $1/T$
 3: **Fail**: Requires storing $A'\in \mathbb R^{M\times N}$


 # Experiments:
 We try three experiments for 3 different settings of $A_t\in \mathbb R ^{50\times 100}$.  

 We plot the norm of the error as a function of t: 

 $$
 (\text{Norm of Mean Error}) = \frac1T \left\| \sum_t^T \left(A_t - u_t \otimes v_t\right)\right\|
 $$

 ## 1: Fixed A

 Fixed Matrix: $A_t \leftarrow A_0 \sim \mathcal N (0, 1)^{M\times N}$

 ![enter image description here](https://drive.google.com/uc?id=0B4IfiNtPKeSAVGRkOUZfcGtsdUE&export=download)

 ## 2: Slowly changing A

 If $A_t$ actually corresponds to $\pderiv{s_t}{\theta}$, this somewhat resembles the realistic scenario.


 \begin{align}
 \alpha &\leftarrow 0.01\\
 A_0 &\sim \mathcal N (0, 1)^{M\times N} \\
 A_t &\sim (1-\alpha)\cdot A_{t-1} + \alpha\cdot \mathcal N(0, 1)
 \end{align}

 ![enter image description here](https://drive.google.com/uc?id=0B4IfiNtPKeSAVlAyZmxuRVMtdDg&export=download)

 ## 3: Completely Changing A:

 $$
 A_t \sim N(0, 1) ^{M\times N}
 $$

 ![enter image description here](https://drive.google.com/uc?id=0B4IfiNtPKeSAa0dsQldLYjBoTjA&export=download)

 (note that UORO and Orthogonal overlay eachother here)

 ## 4: Periodic A

 To highlight a shortcoming of the orthogonal approach: We have an experiment where A is periodic with period 100 (Note that M=50):

 $$
 A_t \sim \begin{cases}N(0, 1)^{\mathcal M\times N} & \text{if } t<100\\A_{t-100} & \text{otherwise}\end{cases}
 $$

 ![enter image description here](https://drive.google.com/uc?id=0B4IfiNtPKeSARl9ZQTd1UHh2STQ&export=download)

 We see that our Orthogonal approximation is "biased", because of aliasing between the 50-periodic $\nu_t$ and the 100-periodic $A_t$.

 # So... Now what?

 Clearly our iterated convergence solution is pretty good.  But it has this annoying requirement that it requires $\mathbb O(M\cdot N)$ memory ... which can be a lot.  Can we find some way to implement the iterated algorithm more efficiently?  Is there maybe some way we can exchange some bias for faster convergence?
	$\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}$
	$\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}}$
	$\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}$
	$\newcommand{\norm}[1]{\frac12\\| #1 \\|_2^2}$
	$\newcommand{argmax}[1]{\underset{#1}{\operatorname{argmax}}}$
	$\newcommand{argmin}[1]{\underset{#1}{\operatorname{argmin}}}$
	$\newcommand{blue}[1]{\color{blue}{#1}}$
	$\newcommand{red}[1]{\color{red}{#1}}$
	$\newcommand{argmax}[1]{\underset{#1}{\operatorname{argmax}}}$
	$\newcommand{argmin}[1]{\underset{#1}{\operatorname{argmin}}}$
	$\newcommand{\numel}[1]{\|#1\|}$


	$\newcommand{\oversize}[3]{\overset{\big[#2 \times #3 \big]}{#1}}$

	$\newcommand{\pderivdim}[4]{\overset{\big[#3 \times #4 \big]}{\frac{\partial #1}{\partial #2}}}$

	# Forward Gradient Estimates for Online Learning

	## Motivation

	In Real Time Recurrent Learning, we update states with forward-mode automatic differentiation:

	$$
	\pderivdim{s_t}{\theta}{\|S\|}{\|\Theta\|} = \oversize{\pderiv{s_t}{s_{t-1}}}{\|S\|}{\|S\|} \oversize{\pderiv {s_{t-1}}{\theta}}{\|S\|}{\|\Theta\|} + \oversize{ \left.\pderiv{s_t}{\theta}\right\|_{s_{t-1}}}{\|S\|}{\|\Theta\|}
	$$

	Where $\pderiv{s_t}{\theta}\|_{s_{t-1}}$ denotes "gradient of $s_t$ with respect to $\theta$ while holding $s_{t-1}$ constant".

	The [UORO paper](https://arxiv.org/abs/1702.05043) makes a Rank-1 approximation:

	$$
	\pderiv{s_t}{\theta} \approx \tilde s \otimes \tilde \theta
	$$

	Where $\tilde s\in \mathbb R^{\|S\|}$, $\tilde \theta \in \mathbb R^{\|\Theta\|}$ are random variables defined such that $\mathbb E[\tilde s \otimes \tilde \theta] = \pderiv{s_t}{\theta}$. In other words, they give an unbiased estimate of the gradient.

	We want to create an algorithm that also generates these $(\tilde s, \tilde \theta)$ pairs, but in a deterministic way, so that they converge faster (at a rate of $1/t$ rather than $1/\sqrt t$) to the true gradient.

	In the following, to simplify the problem, we reformulate the variables to remove references to gradients. We will rename variables as follows:
	\begin{align}
	\pderiv {s_t}{\theta}\in \mathbb R^{\|S\|\times \|\Theta\|} &\rightarrow A_t \in \mathbb R ^{M \times N}\\
	\tilde s_t \in\mathbb R^{\|S\|} &\rightarrow u_t \in \mathbb R ^M\\
	\tilde \theta_t \in\mathbb R^{\|\Theta\|} &\rightarrow v_t \in \mathbb R ^N\\
	\end{align}

	## The problem

	Given a sequence of matrices:

	$$
	A_t \in \mathbb R^{M\times N}: t\in \mathbb N
	$$

	We are looking for a process which produces a series of Rank-1 Approximations:

	\begin{align}
	A_t \approx u_t \otimes v_t : u_t \in \mathbb R^M , v_t \in \mathbb R^N
	\end{align}

	Such that:

	1. The $(u_t, v_t)$ pairs are generated in online (i.e. $(u_t, v_t)$ are computed before $A_{t+1}$ becomes available)
	2. a. The mean error converges for any $A_t: t\in \mathbb N$:
	\begin{align}
	\lim_{T\rightarrow \infty} \frac1T \left\\| \sum_t^T \left(A_t - u_t \otimes v_t\right)\right\\| = 0
	\end{align}
	b. The mean error converges faster than $1/\sqrt T$. (A stricter requirement):
	\begin{align}
	\lim_{T\rightarrow \infty} \frac1{\sqrt{T}} \left\\| \sum_t^T \left(A_t - u_t \otimes v_t\right)\right\\| = 0
	\end{align}
	3. Our process stores less than $\mathcal O(M\cdot N)$ memory between iterations. Basically our end-goal is to find a cheaper (computational and memory-wise) version of RTRL, and this goal is undermined if we must store large matrices. TODO: Be a little clearer on our actual requirement here.

	## Solutions

	### 1. UORO:

	Algorithm:
	\begin{align}
	\nu &\leftarrow U(\{-1, +1\})^M \\
	u_t &\leftarrow \nu \\
	v_t &\leftarrow \nu \cdot A_t
	\end{align}

	Conditions:
	1. Pass: Clearly online
	2a. Pass: Must converge, as $\mathbb E[u_t\otimes v_t] = A_t$ (as shown in [the paper](https://arxiv.org/abs/1702.05043))
	2b. Fail Always: See experiments: Convergence is $1/\sqrt t$, which is too slow.
	3. Pass, no memory between iterations.

	### 2. Orthogonal Sequences

	This is a small variation on UORO. The intuition is that UORO drew random samples of A which averaged out over time. By being more clever about our $\nu$ sequence, we can explicitly cancel out the noise in these samples, leading to $1/t$ (instead of $1/\sqrt t)$ convergence.

	We generate a random orthonormal basis $B\in \mathbb R^{M\times M}$. We then generate our Rank-1 approximations as:

	\begin{align}
	\nu &\leftarrow \sqrt M \cdot B_{\bullet, mod(t, M)}\\
	u_t &\leftarrow \nu \\
	v_t &\leftarrow \nu \cdot A_t
	\end{align}

	Conditions:
	1. Pass: clearly online
	2a. Fail sometimes: Fails with periodic $A_t$ (see experiment)
	2b: Fail sometimes: Fails with periodic $A_t$ (see experiment)
	3: Pass (though we do have to store a fixed $M\times M$ matrix $B$)

	### 3. Iterated Convergence
	The intuition behind this one is that we alternate between greedily minimizing $u$ given $v$, and $v$ given $u$.

	\begin{align}
	v_0 &\leftarrow \mathcal N(0, 1)^N \\
	A' &\leftarrow 0^{M\times N} \\\\
	\\
	A' &\leftarrow A'+ A_t \\
	u_t &\leftarrow \argmin{u_t}\left\|A'-u_t\otimes v_{t-1}\right\| = \frac{A'\cdot v_{t-1}}{\\|v_{t-1}\\|^2}\\
	u_t &\leftarrow \frac{u_t}{\\|u_t\\|} \text{ ... for numerical stability}\\
	v_t &\leftarrow \argmin{v_t}\left\|A'-u_t\otimes v_t\right\| = \frac{u_t\cdot A'}{\\|u_{t}\\|^2} = u_t\cdot A' \\
	A' &\leftarrow A' - u_t\otimes v_t
	\end{align}

	(Note the similarity to herding / Sigma-Delta modulation):

	\begin{align}
	\phi &\leftarrow \phi + x \\
	s &\leftarrow \argmin{s\in\mathbb Z} \|\phi-s\| \\
	\phi &\leftarrow \phi - s\\
	\end{align}

	Conditions:
	1. Pass: clearly online
	2a. Pass, Probably: See experiment
	2b: Pass, Probably: Converges at rate of $1/T$
	3: Fail: Requires storing $A'\in \mathbb R^{M\times N}$


	# Experiments:
	We try three experiments for 3 different settings of $A_t\in \mathbb R ^{50\times 100}$.

	We plot the norm of the error as a function of t:

	$$
	(\text{Norm of Mean Error}) = \frac1T \left\\| \sum_t^T \left(A_t - u_t \otimes v_t\right)\right\\|
	$$

	## 1: Fixed A

	Fixed Matrix: $A_t \leftarrow A_0 \sim \mathcal N (0, 1)^{M\times N}$

	![enter image description here](https://drive.google.com/uc?id=0B4IfiNtPKeSAVGRkOUZfcGtsdUE&export=download)

	## 2: Slowly changing A

	If $A_t$ actually corresponds to $\pderiv{s_t}{\theta}$, this somewhat resembles the realistic scenario.


	\begin{align}
	\alpha &\leftarrow 0.01\\
	A_0 &\sim \mathcal N (0, 1)^{M\times N} \\
	A_t &\sim (1-\alpha)\cdot A_{t-1} + \alpha\cdot \mathcal N(0, 1)
	\end{align}

	![enter image description here](https://drive.google.com/uc?id=0B4IfiNtPKeSAVlAyZmxuRVMtdDg&export=download)

	## 3: Completely Changing A:

	$$
	A_t \sim N(0, 1) ^{M\times N}
	$$

	![enter image description here](https://drive.google.com/uc?id=0B4IfiNtPKeSAa0dsQldLYjBoTjA&export=download)

	(note that UORO and Orthogonal overlay eachother here)

	## 4: Periodic A

	To highlight a shortcoming of the orthogonal approach: We have an experiment where A is periodic with period 100 (Note that M=50):

	$$
	A_t \sim \begin{cases}N(0, 1)^{\mathcal M\times N} & \text{if } t<100\\A_{t-100} & \text{otherwise}\end{cases}
	$$

	![enter image description here](https://drive.google.com/uc?id=0B4IfiNtPKeSARl9ZQTd1UHh2STQ&export=download)

	We see that our Orthogonal approximation is "biased", because of aliasing between the 50-periodic $\nu_t$ and the 100-periodic $A_t$.

	# So... Now what?

	Clearly our iterated convergence solution is pretty good. But it has this annoying requirement that it requires $\mathbb O(M\cdot N)$ memory ... which can be a lot. Can we find some way to implement the iterated algorithm more efficiently? Is there maybe some way we can exchange some bias for faster convergence?