Last active
September 27, 2017 06:51
-
-
Save petered/b3a668193e5c83512176b282abc03a7b to your computer and use it in GitHub Desktop.
2017-09-26 Iterated Matrix Decomposition
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}$ | |
$\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}}$ | |
$\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}$ | |
$\newcommand{\norm}[1]{\frac12\| #1 \|_2^2}$ | |
$\newcommand{argmax}[1]{\underset{#1}{\operatorname{argmax}}}$ | |
$\newcommand{argmin}[1]{\underset{#1}{\operatorname{argmin}}}$ | |
$\newcommand{blue}[1]{\color{blue}{#1}}$ | |
$\newcommand{red}[1]{\color{red}{#1}}$ | |
$\newcommand{argmax}[1]{\underset{#1}{\operatorname{argmax}}}$ | |
$\newcommand{argmin}[1]{\underset{#1}{\operatorname{argmin}}}$ | |
$\newcommand{\numel}[1]{|#1|}$ | |
$\newcommand{\oversize}[3]{\overset{\big[#2 \times #3 \big]}{#1}}$ | |
$\newcommand{\pderivdim}[4]{\overset{\big[#3 \times #4 \big]}{\frac{\partial #1}{\partial #2}}}$ | |
# Forward Gradient Estimates for Online Learning | |
## Motivation | |
In Real Time Recurrent Learning, we update states with forward-mode automatic differentiation: | |
$$ | |
\pderivdim{s_t}{\theta}{|S|}{|\Theta|} = \oversize{\pderiv{s_t}{s_{t-1}}}{|S|}{|S|} \oversize{\pderiv {s_{t-1}}{\theta}}{|S|}{|\Theta|} + \oversize{ \left.\pderiv{s_t}{\theta}\right|_{s_{t-1}}}{|S|}{|\Theta|} | |
$$ | |
Where $\pderiv{s_t}{\theta}|_{s_{t-1}}$ denotes "gradient of $s_t$ with respect to $\theta$ while holding $s_{t-1}$ constant". | |
The [UORO paper](https://arxiv.org/abs/1702.05043) makes a Rank-1 approximation: | |
$$ | |
\pderiv{s_t}{\theta} \approx \tilde s \otimes \tilde \theta | |
$$ | |
Where $\tilde s\in \mathbb R^{|S|}$, $\tilde \theta \in \mathbb R^{|\Theta|}$ are random variables defined such that $\mathbb E[\tilde s \otimes \tilde \theta] = \pderiv{s_t}{\theta}$. In other words, they give an unbiased estimate of the gradient. | |
We want to create an algorithm that also generates these $(\tilde s, \tilde \theta)$ pairs, but in a deterministic way, so that they converge faster (at a rate of $1/t$ rather than $1/\sqrt t$) to the true gradient. | |
In the following, to simplify the problem, we reformulate the variables to remove references to gradients. We will rename variables as follows: | |
\begin{align} | |
\pderiv {s_t}{\theta}\in \mathbb R^{|S|\times |\Theta|} &\rightarrow A_t \in \mathbb R ^{M \times N}\\ | |
\tilde s_t \in\mathbb R^{|S|} &\rightarrow u_t \in \mathbb R ^M\\ | |
\tilde \theta_t \in\mathbb R^{|\Theta|} &\rightarrow v_t \in \mathbb R ^N\\ | |
\end{align} | |
## The problem | |
Given a sequence of matrices: | |
$$ | |
A_t \in \mathbb R^{M\times N}: t\in \mathbb N | |
$$ | |
We are looking for a process which produces a series of Rank-1 Approximations: | |
\begin{align} | |
A_t \approx u_t \otimes v_t : u_t \in \mathbb R^M , v_t \in \mathbb R^N | |
\end{align} | |
Such that: | |
1. The $(u_t, v_t)$ pairs are generated in *online* (i.e. $(u_t, v_t)$ are computed before $A_{t+1}$ becomes available) | |
2. a. The mean error converges for any $A_t: t\in \mathbb N$: | |
\begin{align} | |
\lim_{T\rightarrow \infty} \frac1T \left\| \sum_t^T \left(A_t - u_t \otimes v_t\right)\right\| = 0 | |
\end{align} | |
b. The mean error converges faster than $1/\sqrt T$. (A stricter requirement): | |
\begin{align} | |
\lim_{T\rightarrow \infty} \frac1{\sqrt{T}} \left\| \sum_t^T \left(A_t - u_t \otimes v_t\right)\right\| = 0 | |
\end{align} | |
3. Our process stores less than $\mathcal O(M\cdot N)$ memory between iterations. Basically our end-goal is to find a cheaper (computational and memory-wise) version of RTRL, and this goal is undermined if we must store large matrices. TODO: Be a little clearer on our actual requirement here. | |
## Solutions | |
### 1. UORO: | |
Algorithm: | |
\begin{align} | |
\nu &\leftarrow U(\{-1, +1\})^M \\ | |
u_t &\leftarrow \nu \\ | |
v_t &\leftarrow \nu \cdot A_t | |
\end{align} | |
Conditions: | |
1. **Pass**: Clearly online | |
2a. **Pass**: Must converge, as $\mathbb E[u_t\otimes v_t] = A_t$ (as shown in [the paper](https://arxiv.org/abs/1702.05043)) | |
2b. **Fail Always**: See experiments: Convergence is $1/\sqrt t$, which is too slow. | |
3. **Pass**, no memory between iterations. | |
### 2. Orthogonal Sequences | |
This is a small variation on UORO. The intuition is that UORO drew random samples of A which averaged out over time. By being more clever about our $\nu$ sequence, we can explicitly cancel out the noise in these samples, leading to $1/t$ (instead of $1/\sqrt t)$ convergence. | |
We generate a random orthonormal basis $B\in \mathbb R^{M\times M}$. We then generate our Rank-1 approximations as: | |
\begin{align} | |
\nu &\leftarrow \sqrt M \cdot B_{\bullet, mod(t, M)}\\ | |
u_t &\leftarrow \nu \\ | |
v_t &\leftarrow \nu \cdot A_t | |
\end{align} | |
Conditions: | |
1. **Pass**: clearly online | |
2a. **Fail sometimes**: Fails with periodic $A_t$ (see experiment) | |
2b: **Fail sometimes**: Fails with periodic $A_t$ (see experiment) | |
3: **Pass** (though we do have to store a fixed $M\times M$ matrix $B$) | |
### 3. Iterated Convergence | |
The intuition behind this one is that we alternate between greedily minimizing $u$ given $v$, and $v$ given $u$. | |
\begin{align} | |
v_0 &\leftarrow \mathcal N(0, 1)^N \\ | |
A' &\leftarrow 0^{M\times N} \\\\ | |
\\ | |
A' &\leftarrow A'+ A_t \\ | |
u_t &\leftarrow \argmin{u_t}\left|A'-u_t\otimes v_{t-1}\right| = \frac{A'\cdot v_{t-1}}{\|v_{t-1}\|^2}\\ | |
u_t &\leftarrow \frac{u_t}{\|u_t\|} \text{ ... for numerical stability}\\ | |
v_t &\leftarrow \argmin{v_t}\left|A'-u_t\otimes v_t\right| = \frac{u_t\cdot A'}{\|u_{t}\|^2} = u_t\cdot A' \\ | |
A' &\leftarrow A' - u_t\otimes v_t | |
\end{align} | |
(Note the similarity to herding / Sigma-Delta modulation): | |
\begin{align} | |
\phi &\leftarrow \phi + x \\ | |
s &\leftarrow \argmin{s\in\mathbb Z} |\phi-s| \\ | |
\phi &\leftarrow \phi - s\\ | |
\end{align} | |
Conditions: | |
1. **Pass**: clearly online | |
2a. **Pass, Probably**: See experiment | |
2b: **Pass, Probably**: Converges at rate of $1/T$ | |
3: **Fail**: Requires storing $A'\in \mathbb R^{M\times N}$ | |
# Experiments: | |
We try three experiments for 3 different settings of $A_t\in \mathbb R ^{50\times 100}$. | |
We plot the norm of the error as a function of t: | |
$$ | |
(\text{Norm of Mean Error}) = \frac1T \left\| \sum_t^T \left(A_t - u_t \otimes v_t\right)\right\| | |
$$ | |
## 1: Fixed A | |
Fixed Matrix: $A_t \leftarrow A_0 \sim \mathcal N (0, 1)^{M\times N}$ | |
 | |
## 2: Slowly changing A | |
If $A_t$ actually corresponds to $\pderiv{s_t}{\theta}$, this somewhat resembles the realistic scenario. | |
\begin{align} | |
\alpha &\leftarrow 0.01\\ | |
A_0 &\sim \mathcal N (0, 1)^{M\times N} \\ | |
A_t &\sim (1-\alpha)\cdot A_{t-1} + \alpha\cdot \mathcal N(0, 1) | |
\end{align} | |
 | |
## 3: Completely Changing A: | |
$$ | |
A_t \sim N(0, 1) ^{M\times N} | |
$$ | |
 | |
(note that UORO and Orthogonal overlay eachother here) | |
## 4: Periodic A | |
To highlight a shortcoming of the orthogonal approach: We have an experiment where A is periodic with period 100 (Note that M=50): | |
$$ | |
A_t \sim \begin{cases}N(0, 1)^{\mathcal M\times N} & \text{if } t<100\\A_{t-100} & \text{otherwise}\end{cases} | |
$$ | |
 | |
We see that our Orthogonal approximation is "biased", because of aliasing between the 50-periodic $\nu_t$ and the 100-periodic $A_t$. | |
# So... Now what? | |
Clearly our iterated convergence solution is pretty good. But it has this annoying requirement that it requires $\mathbb O(M\cdot N)$ memory ... which can be a lot. Can we find some way to implement the iterated algorithm more efficiently? Is there maybe some way we can exchange some bias for faster convergence? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment