Expected gradient and score

==== Integration by parts ====

$\int_a^b (p\cdot f)' dx = (p\cdot f)_a^b = (p\cdot f)(b) - (p\cdot f)(a)$

$\implies$

$\int_a^b p\cdot f', dx = -\int_a^b p'\cdot f, dx + (p\cdot f)(b) - (p\cdot f)(a)$

=== Score ===

$s_{\theta^}:=\frac{p'}{p}=\frac{d}{dx} \log p(x)$ Note that $p' = s_{\theta^}\cdot p$

For convenience here, write $g$ for $g(x)$ when the notation is clear from context. The notation comes from an ML context and assuming this score can be learned, fixing parameters $\theta$, with $\theta^$ being an optimum. Having $s_{\theta^}$, we can sample from it to effectively swim towards $p$'s extrema (and reconstruct its neighborhood relative to that extremum.

=== Probability measure variant ===

$\int_a^b p\cdot f', dx = -\int_a^b p'\cdot f, dx + (p\cdot f)(b) - (p\cdot f)(a)$

Let $\mu$ measure $[a,b]$ and write $d\mu=p,dx$.

$E_\mu[\frac{d}{dx}f] = E_\mu[-s_{\theta^*} \cdot f] + (p\cdot f)(b) - (p\cdot f)(a)$

or again together,

$E_\mu[\left(s_{\theta^*} + \frac{d}{dx}\right) f(t)] = (p\cdot f)(b) - (p\cdot f)(a)$

If either $f$ or $p$ is zero at the bounds, the right hand is zero. Then, in expectation, multiplication by $-s_{\theta^}$ acts as gradient application. Knowing $s_{\theta^}$ encoded having already take the derivative (but of $p$), and gradient computation (of any $f$) can then be done by local point evaluation.

===Special case===

$f$ is a translation of the distribution, i.e. $f(x)=p(x-d)$

$E_\mu[p'(x-d)] = \int \left(\left(-\frac{p'(x)}{p(x)}\right) p(x-d)\right) p(x), dx = \int \left(-p'(x)\right) p(x-d) dx = E_\mu[-p'(x+d)]$

(See 4 drawings)

====Optimization====

To find $\theta$, we want to minimize

$E_\mu[\vert\vert s_{\theta} - s_{\theta^} \vert\vert^2] = E_\mu[s_{\theta}^2 - 2 s_{\theta^} s_{\theta} + s_{\theta^*}^2] = E_\mu[\left(s_{\theta} + 2\frac{d}{dx}\right) s_{\theta}] + c$

where $c$ is a positive constant, propto the mean score gradient of the true distribution itself. Those calculations generalize from $\frac{d}{dx}$ to $\nabla_x$.

Note we have replaced required knowledge of functional form of $s_{\theta^*}$ with the need for computation of the descent $-\frac{d}{dx}$.

Nikolaj-K/expected_gradient_and_score.md