Created
January 4, 2018 15:36
-
-
Save petered/2e26a10173e20d0db1d31fdee6d35913 to your computer and use it in GitHub Desktop.
2017-11-20 Online Learning Update
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$$ | |
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}} | |
\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}} | |
\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}} | |
\newcommand{\pderivgiven}[3]{\left.\frac{\partial #1}{\partial #2}\right|_{#3}} | |
\newcommand{\norm}[1]{\frac12\| #1 \|_2^2} | |
\newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} | |
\newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}} | |
\newcommand{\blue}[1]{\color{blue}{#1}} | |
\newcommand{\red}[1]{\color{red}{#1}} | |
\newcommand{\numel}[1]{|#1|} | |
\newcommand{\switch}[3]{\begin{cases} #2 & \text{if } {#1} \\ #3 &\text{otherwise}\end{cases}} | |
\newcommand{\pderivdim}[4]{\overset{\big[#3 \times #4 \big]}{\frac{\partial #1}{\partial #2}}} | |
\newcommand{\softmax}{\operatorname{softmax}} | |
\newcommand{\Bern}{\operatorname{Bern}} | |
\newcommand{\Cat}{\operatorname{Cat}} | |
\newcommand{\sigm}{\operatorname{sigm}} | |
\newcommand{\logfrac}[2]{\log \left( \frac{#1}{#2} \right)} | |
$$ | |
# Online learning update: | |
Goal: We are looking to do slow-rank approximations of gradients in order to speed convergence on online learning problems. | |
We define a low-rank approximation $\text{R}_1: |S|\times |\Theta| \rightarrow (|S|, |\Theta|)$ | |
We can do it all in one step: | |
$$ | |
\widehat{ \pderiv{s_t}{\theta}} = \text{R}_1\left( \pderiv{s_t}{s_{t-1}} \widehat{\pderiv{s_{t-1}}{\theta}} + \pderivgiven{s_t}{\theta}{s_{t-1}}\right) | |
$$ | |
Or we can break it down into 2 steps (as is done in UORO) | |
$$ | |
\widehat{ \pderiv{s_t}{\theta}} = \text{R}_1\left( \pderiv{s_t}{s_{t-1}} \widehat{\pderiv{s_{t-1}}{\theta}}+\text{R}_1\left( \pderivgiven{s_t}{\theta}{s_{t-1}}\right)\right) | |
$$ | |
In the end, we want something that, on average, does a good job at estimating the gradient. ie. | |
$$ | |
\mathcal L = \left\|\frac 1T \sum_t^T \left(\pderiv {s_t}{\theta}- \tilde s_t \otimes \tilde \theta_t \right)\right\| | |
$$ | |
Is minimized. | |
# Approaches: | |
Suppose A is our matrix: | |
## UORO: | |
(See paper for details) | |
If you draw $\nu \sim \{-1, +1\}^M$, you can approximate a matrix $A\in\mathbb R^{M\times N}$ as | |
$$ | |
\tilde A = \nu \otimes (\nu^T \cdot A_{i, \cdot}) | |
$$ | |
Where $e_i\in \{0, 1\}^M$ is a one-hot vector. | |
Reasoning being that | |
\begin{align} | |
\mathbb E_\nu [\tilde A] &= \mathbb E\left[\nu \otimes (\nu^T \cdot A_{i, \cdot})\right] \\ | |
&= \mathbb E_\nu\left[ \left(\sum_i\nu_i e_i\right) \otimes \left(\sum_i \nu_i \cdot A_{i, \cdot}\right)\right] \\ | |
&= \mathbb E_\nu\left[ \left(\sum_i\nu_i^2 e_i^T A_{i, \cdot}\right) + \sum _i \sum_{j\neq i}\nu_i\nu_j \cdot e_j^T \cdot A_{i, \cdot}\right] \\ | |
&= \sum_i e_i^T A_{i, \cdot} = A\\ | |
\end{align} | |
## Hadamard/Orthogonal | |
We observe that UORO is based on the cancellation of rows over time, we can accelerate this method by selecting $\nu$ such that we achieve optimally fast cancellation. This will work if $\nu$'s are drawn from an orthogonal matrix. | |
## Iterated: | |
The intuition behind this one is that, if we're approximating we alternate between greedily minimizing $u$ given $v$, and $v$ given $u$. | |
\begin{align} | |
v_0 &\leftarrow \mathcal N(0, 1)^N \\ | |
A' &\leftarrow 0^{M\times N} \\\\ | |
\\ | |
A' &\leftarrow A'+ A_t \\ | |
u_t &\leftarrow \argmin{u_t}\left|A'-u_t\otimes v_{t-1}\right| = \frac{A'\cdot v_{t-1}}{\|v_{t-1}\|^2}\\ | |
u_t &\leftarrow \frac{u_t}{\|u_t\|} \text{ ... for numerical stability}\\ | |
v_t &\leftarrow \argmin{v_t}\left|A'-u_t\otimes v_t\right| = \frac{u_t\cdot A'}{\|u_{t}\|^2} = u_t\cdot A' \\ | |
A' &\leftarrow A' - u_t\otimes v_t | |
\end{align} | |
The problem is, we have to store a full-rank matrix $A'$. | |
# Experiments | |
See [previous documents for plots on matrix-approximation](https://stackedit.io/viewer#!provider=gist&gistId=b3a668193e5c83512176b282abc03a7b&filename=iterated-matrix-decomposition) | |
Generate a random stream of data $(x_t, y_t)$, and compare the true gradient $\pderiv {s_t}{\theta}$ to the approximation: $\widehat{\pderiv {s_t}{\theta}}$ | |
 | |
Again, it seems that our iterated approach works very well. Why then, when we run it on $a^nb^n(1, 32)$ do we get such drastic failure? | |
 | |
It could have something to do with compounding errors. | |
With unbiased estimators, if $\widehat{\pderiv {s_{t-1}}{\theta}}$ is an unbiased estimator of $\pderiv {s_{t-1}}{\theta}$, then | |
$$ | |
\widehat{\pderiv {s_{t}}{\theta}} = \pderiv {s_t}{s_{t-1}} \widehat{\pderiv {s_{t-1}}{\theta}} + \pderivgiven{s_t}{\theta}{s_{t-1}} | |
$$ | |
Is an unbiased estimator of $\pderiv {s_{t}}{\theta}$ (because unbiasedness survives affine transforms). | |
However, the iterated estimator is not even stochastic (so clearly not unbiased). | |
We need to understant when | |
$$ | |
\widehat{ \pderiv{s_t}{\theta}} = \text{R}_1\left( \pderiv{s_t}{s_{t-1}} \widehat{\pderiv{s_{t-1}}{\theta}} + \pderivgiven{s_t}{\theta}{s_{t-1}}\right) | |
$$ | |
Does not converge. | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment