petered · January 4, 2018 15:36
diff --git a/fewfds b/fewfds
 $$
 \newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
 \newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}}
 \newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}
 \newcommand{\pderivgiven}[3]{\left.\frac{\partial #1}{\partial #2}\right|_{#3}}
 \newcommand{\norm}[1]{\frac12\| #1 \|_2^2}
 \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
 \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}
 \newcommand{\blue}[1]{\color{blue}{#1}}
 \newcommand{\red}[1]{\color{red}{#1}}
 \newcommand{\numel}[1]{|#1|}
 \newcommand{\switch}[3]{\begin{cases} #2 & \text{if } {#1} \\ #3 &\text{otherwise}\end{cases}}
 \newcommand{\pderivdim}[4]{\overset{\big[#3 \times #4 \big]}{\frac{\partial #1}{\partial #2}}}
 \newcommand{\softmax}{\operatorname{softmax}}
 \newcommand{\Bern}{\operatorname{Bern}}
 \newcommand{\Cat}{\operatorname{Cat}}
 \newcommand{\sigm}{\operatorname{sigm}}
 \newcommand{\logfrac}[2]{\log \left( \frac{#1}{#2} \right)}
 $$

 # Online learning update:

 Goal: We are looking to do slow-rank approximations of gradients in order to speed convergence on online learning problems.

 We define a low-rank approximation $\text{R}_1: |S|\times |\Theta| \rightarrow (|S|, |\Theta|)$

 We can do it all in one step:
 $$
 \widehat{ \pderiv{s_t}{\theta}}  = \text{R}_1\left( \pderiv{s_t}{s_{t-1}} \widehat{\pderiv{s_{t-1}}{\theta}} + \pderivgiven{s_t}{\theta}{s_{t-1}}\right)
 $$
 Or we can break it down into 2 steps (as is done in UORO)
 $$
 \widehat{ \pderiv{s_t}{\theta}}  = \text{R}_1\left( \pderiv{s_t}{s_{t-1}} \widehat{\pderiv{s_{t-1}}{\theta}}+\text{R}_1\left( \pderivgiven{s_t}{\theta}{s_{t-1}}\right)\right)
 $$

 In the end, we want something that, on average, does a good job at estimating the gradient.  ie.

 $$
 \mathcal L = \left\|\frac 1T \sum_t^T \left(\pderiv {s_t}{\theta}- \tilde s_t \otimes \tilde \theta_t \right)\right\|
 $$
 Is minimized.

 # Approaches:

 Suppose A is our matrix: 

 ## UORO:

 (See paper for details)

 If you draw $\nu \sim \{-1, +1\}^M$, you can approximate a matrix $A\in\mathbb R^{M\times N}$ as 

 $$
 \tilde A = \nu \otimes (\nu^T \cdot A_{i, \cdot})
 $$
 Where $e_i\in \{0, 1\}^M$ is a one-hot vector.

 Reasoning being that 

 \begin{align}
 \mathbb E_\nu [\tilde A] &= \mathbb E\left[\nu \otimes (\nu^T \cdot A_{i, \cdot})\right] \\
 &= \mathbb E_\nu\left[ \left(\sum_i\nu_i e_i\right) \otimes \left(\sum_i \nu_i \cdot A_{i, \cdot}\right)\right] \\
 &= \mathbb E_\nu\left[ \left(\sum_i\nu_i^2 e_i^T A_{i, \cdot}\right) + \sum _i \sum_{j\neq i}\nu_i\nu_j \cdot e_j^T \cdot A_{i, \cdot}\right] \\
 &= \sum_i e_i^T A_{i, \cdot}  = A\\
 \end{align}


 ## Hadamard/Orthogonal
 We observe that UORO is based on the cancellation of rows over time, we can accelerate this method by selecting $\nu$ such that we achieve optimally fast cancellation.  This will work if $\nu$'s are drawn from an orthogonal matrix.  

 ## Iterated:
 The intuition behind this one is that, if we're approximating  we alternate between greedily minimizing $u$ given $v$, and $v$ given $u$.  

 \begin{align}
 v_0 &\leftarrow \mathcal N(0, 1)^N \\
 A' &\leftarrow 0^{M\times N} \\\\
 \\
 A' &\leftarrow A'+ A_t \\
 u_t &\leftarrow \argmin{u_t}\left|A'-u_t\otimes v_{t-1}\right| = \frac{A'\cdot v_{t-1}}{\|v_{t-1}\|^2}\\
 u_t &\leftarrow \frac{u_t}{\|u_t\|} \text{  ... for numerical stability}\\
 v_t &\leftarrow \argmin{v_t}\left|A'-u_t\otimes v_t\right| = \frac{u_t\cdot A'}{\|u_{t}\|^2} = u_t\cdot A' \\
 A' &\leftarrow A' - u_t\otimes v_t
 \end{align}

 The problem is, we have to store a full-rank matrix $A'$.


 # Experiments

 See [previous documents for plots on matrix-approximation](https://stackedit.io/viewer#!provider=gist&gistId=b3a668193e5c83512176b282abc03a7b&filename=iterated-matrix-decomposition)

 Generate a random stream of data $(x_t, y_t)$, and compare the true gradient $\pderiv {s_t}{\theta}$ to the approximation:  $\widehat{\pderiv {s_t}{\theta}}$

 ![enter image description here](https://drive.google.com/uc?id=1URv8S2ucJhy3h4V9qbMs8juBF4ISA71z&export=download)

 Again, it seems that our iterated approach works very well.  Why then, when we run it on $a^nb^n(1, 32)$ do we get such drastic failure?

 ![enter image description here](https://drive.google.com/uc?id=1f2HIPMBxzWMnH8artJ2C8tBX-zD5fjve&export=download)

 It could have something to do with compounding errors.

 With unbiased estimators, if $\widehat{\pderiv {s_{t-1}}{\theta}}$ is an unbiased estimator of $\pderiv {s_{t-1}}{\theta}$, then 

 $$
 \widehat{\pderiv {s_{t}}{\theta}} = \pderiv {s_t}{s_{t-1}} \widehat{\pderiv {s_{t-1}}{\theta}} + \pderivgiven{s_t}{\theta}{s_{t-1}}
 $$
 Is an unbiased estimator of $\pderiv {s_{t}}{\theta}$ (because unbiasedness survives affine transforms).  

 However, the iterated estimator is not even stochastic (so clearly not unbiased). 

 We need to understant when 

 $$
 \widehat{ \pderiv{s_t}{\theta}}  = \text{R}_1\left( \pderiv{s_t}{s_{t-1}} \widehat{\pderiv{s_{t-1}}{\theta}} + \pderivgiven{s_t}{\theta}{s_{t-1}}\right)
 $$

 Does not converge.
	$$
	\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
	\newcommand{\pderivsq}[2]{\frac{\partial^2 #1}{\partial #2^2}}
	\newcommand{\lderiv}[1]{\frac{\partial \mathcal L}{\partial #1}}
	\newcommand{\pderivgiven}[3]{\left.\frac{\partial #1}{\partial #2}\right\|_{#3}}
	\newcommand{\norm}[1]{\frac12\\| #1 \\|_2^2}
	\newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
	\newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}
	\newcommand{\blue}[1]{\color{blue}{#1}}
	\newcommand{\red}[1]{\color{red}{#1}}
	\newcommand{\numel}[1]{\|#1\|}
	\newcommand{\switch}[3]{\begin{cases} #2 & \text{if } {#1} \\ #3 &\text{otherwise}\end{cases}}
	\newcommand{\pderivdim}[4]{\overset{\big[#3 \times #4 \big]}{\frac{\partial #1}{\partial #2}}}
	\newcommand{\softmax}{\operatorname{softmax}}
	\newcommand{\Bern}{\operatorname{Bern}}
	\newcommand{\Cat}{\operatorname{Cat}}
	\newcommand{\sigm}{\operatorname{sigm}}
	\newcommand{\logfrac}[2]{\log \left( \frac{#1}{#2} \right)}
	$$

	# Online learning update:

	Goal: We are looking to do slow-rank approximations of gradients in order to speed convergence on online learning problems.

	We define a low-rank approximation $\text{R}_1: \|S\|\times \|\Theta\| \rightarrow (\|S\|, \|\Theta\|)$

	We can do it all in one step:
	$$
	\widehat{ \pderiv{s_t}{\theta}} = \text{R}_1\left( \pderiv{s_t}{s_{t-1}} \widehat{\pderiv{s_{t-1}}{\theta}} + \pderivgiven{s_t}{\theta}{s_{t-1}}\right)
	$$
	Or we can break it down into 2 steps (as is done in UORO)
	$$
	\widehat{ \pderiv{s_t}{\theta}} = \text{R}_1\left( \pderiv{s_t}{s_{t-1}} \widehat{\pderiv{s_{t-1}}{\theta}}+\text{R}_1\left( \pderivgiven{s_t}{\theta}{s_{t-1}}\right)\right)
	$$

	In the end, we want something that, on average, does a good job at estimating the gradient. ie.

	$$
	\mathcal L = \left\\|\frac 1T \sum_t^T \left(\pderiv {s_t}{\theta}- \tilde s_t \otimes \tilde \theta_t \right)\right\\|
	$$
	Is minimized.

	# Approaches:

	Suppose A is our matrix:

	## UORO:

	(See paper for details)

	If you draw $\nu \sim \{-1, +1\}^M$, you can approximate a matrix $A\in\mathbb R^{M\times N}$ as

	$$
	\tilde A = \nu \otimes (\nu^T \cdot A_{i, \cdot})
	$$
	Where $e_i\in \{0, 1\}^M$ is a one-hot vector.

	Reasoning being that

	\begin{align}
	\mathbb E_\nu [\tilde A] &= \mathbb E\left[\nu \otimes (\nu^T \cdot A_{i, \cdot})\right] \\
	&= \mathbb E_\nu\left[ \left(\sum_i\nu_i e_i\right) \otimes \left(\sum_i \nu_i \cdot A_{i, \cdot}\right)\right] \\
	&= \mathbb E_\nu\left[ \left(\sum_i\nu_i^2 e_i^T A_{i, \cdot}\right) + \sum _i \sum_{j\neq i}\nu_i\nu_j \cdot e_j^T \cdot A_{i, \cdot}\right] \\
	&= \sum_i e_i^T A_{i, \cdot} = A\\
	\end{align}


	## Hadamard/Orthogonal
	We observe that UORO is based on the cancellation of rows over time, we can accelerate this method by selecting $\nu$ such that we achieve optimally fast cancellation. This will work if $\nu$'s are drawn from an orthogonal matrix.

	## Iterated:
	The intuition behind this one is that, if we're approximating we alternate between greedily minimizing $u$ given $v$, and $v$ given $u$.

	\begin{align}
	v_0 &\leftarrow \mathcal N(0, 1)^N \\
	A' &\leftarrow 0^{M\times N} \\\\
	\\
	A' &\leftarrow A'+ A_t \\
	u_t &\leftarrow \argmin{u_t}\left\|A'-u_t\otimes v_{t-1}\right\| = \frac{A'\cdot v_{t-1}}{\\|v_{t-1}\\|^2}\\
	u_t &\leftarrow \frac{u_t}{\\|u_t\\|} \text{ ... for numerical stability}\\
	v_t &\leftarrow \argmin{v_t}\left\|A'-u_t\otimes v_t\right\| = \frac{u_t\cdot A'}{\\|u_{t}\\|^2} = u_t\cdot A' \\
	A' &\leftarrow A' - u_t\otimes v_t
	\end{align}

	The problem is, we have to store a full-rank matrix $A'$.


	# Experiments

	See [previous documents for plots on matrix-approximation](https://stackedit.io/viewer#!provider=gist&gistId=b3a668193e5c83512176b282abc03a7b&filename=iterated-matrix-decomposition)

	Generate a random stream of data $(x_t, y_t)$, and compare the true gradient $\pderiv {s_t}{\theta}$ to the approximation: $\widehat{\pderiv {s_t}{\theta}}$

	![enter image description here](https://drive.google.com/uc?id=1URv8S2ucJhy3h4V9qbMs8juBF4ISA71z&export=download)

	Again, it seems that our iterated approach works very well. Why then, when we run it on $a^nb^n(1, 32)$ do we get such drastic failure?

	![enter image description here](https://drive.google.com/uc?id=1f2HIPMBxzWMnH8artJ2C8tBX-zD5fjve&export=download)

	It could have something to do with compounding errors.

	With unbiased estimators, if $\widehat{\pderiv {s_{t-1}}{\theta}}$ is an unbiased estimator of $\pderiv {s_{t-1}}{\theta}$, then

	$$
	\widehat{\pderiv {s_{t}}{\theta}} = \pderiv {s_t}{s_{t-1}} \widehat{\pderiv {s_{t-1}}{\theta}} + \pderivgiven{s_t}{\theta}{s_{t-1}}
	$$
	Is an unbiased estimator of $\pderiv {s_{t}}{\theta}$ (because unbiasedness survives affine transforms).

	However, the iterated estimator is not even stochastic (so clearly not unbiased).

	We need to understant when

	$$
	\widehat{ \pderiv{s_t}{\theta}} = \text{R}_1\left( \pderiv{s_t}{s_{t-1}} \widehat{\pderiv{s_{t-1}}{\theta}} + \pderivgiven{s_t}{\theta}{s_{t-1}}\right)
	$$

	Does not converge.