Skip to content

Instantly share code, notes, and snippets.

@ogrisel
Created January 9, 2012 09:07
Show Gist options
  • Select an option

  • Save ogrisel/1582089 to your computer and use it in GitHub Desktop.

Select an option

Save ogrisel/1582089 to your computer and use it in GitHub Desktop.
Closed form formula for the eps estimate from Johnson Lindenstrauss' lemma

Goal

Find a closed form formula of the estimated epsilon (squared distances ratio distortion) of the Johnson Lindenstrauss lemma. The goal is to implement it either as a pure python or pure numpy function to compute the eps out of the number of points (samples or observation) n and target dimension d.

Problem

The root expression as found by sympy is hard to rewrite as a python function that returns a complex variable: intermediate expression evaluation tend to overflow a lot. sympy is able to compute the numerical value when using evalf but not compile the expression to a numerically stable numpy implementation when using lambdify.

Warning: invalid value encountered in double_scalars
Warning: invalid value encountered in double_scalars
Warning: invalid value encountered in double_scalars
Warning: invalid value encountered in double_scalars
import sympy as sp
import numpy as np
import pylab as pl
d, n = sp.symbols('d n', integer=True, positive=True)
eps = sp.symbols('eps', real=True, positive=True)
jl_bound = 4 * sp.log(n) / (eps ** 2 / 2 - eps ** 3 / 3)
# solve the inverted expression as sympy is much faster as finding the roots of
# a polynomial and the point eps=0 is not interesting to us anyway.
solutions = sp.solve(sp.Equality(1 / d, 1 / jl_bound), eps)
for sol_idx, sol in enumerate(solutions):
print "Solution #%d" % sol_idx
sp.pprint(sol)
print
def plot_with_sympy(d_range=np.logspace(1, 4, 16), n_value=100):
"""Plot the complex roots of the JL bound polynomial"""
for sol_idx, sol in enumerate(solutions):
real = np.zeros(d_range.shape[0])
imag = np.zeros(d_range.shape[0])
for i, d_value in enumerate(d_range):
eps_value = sol.evalf(subs={d: d_value, n: n_value}, chop=True)
real[i], imag[i] = eps_value.as_real_imag()
pl.plot(real, imag, label='solution #%d' % sol_idx)
pl.legend(loc='best')
pl.xlabel("Real part of the root")
pl.ylabel("Imaginary part of the root")
pl.title("Roots path in the complex plan:\n"
"n=100, d in range 10-10000")
pl.show()
# manual inspection of the three solutions tells us that the minimal positive
# real solution is the second among 3 (one negative real and two complex with a
# positive real part and a sometimes null imaginary part)
sol = solutions[1]
# Unfortunately those evaluations are not numerically stable when using numpy
jl_eps_mpmath = sp.lambdify((d, n), sol, "mpmath")
jl_eps_numpy = sp.lambdify((d, n), sol, "numpy")
print "Numerical evaluation with lambdify:"
print
n_value = 100
for d_value in np.logspace(1, 4, 4):
print "mpmath: d=%d, n=%d, eps=%r" % (
d_value, n_value, jl_eps_mpmath(d_value, n_value))
print "numpy: d=%d, n=%d, eps=%r" % (
d_value, n_value, jl_eps_numpy(d_value, n_value))
print
# Second bad news: sympy cannot find an explicit closed form formula of the
# real and imaginary parts: uncommenting and executing the following will
# trigger a RuntimeException after a while:
# RuntimeError: maximum recursion depth exceeded while calling a Python object
#sp.expand(sol, complex=True)
Solution #0
_________________________________________________
/ _________________________
/ / 2
/ / / 1 12*log(n)\
/ ___ \ / / |- - + ---------|
| 1 \/ 3 *I| / / \ 4 d / 1 1 6*log(n) 1 1
- |- - + -------|*3 / / ------------------ - -- - - + -------- + - - ---------------------------------------------------------------------------
\ 2 2 / \/ \/ 4 64 8 d 2 _________________________________________________
/ _________________________
/ / 2
/ / / 1 12*log(n)\
/ ___ \ / / |- - + ---------|
| 1 \/ 3 *I| / / \ 4 d / 1 1 6*log(n)
4*|- - + -------|*3 / / ------------------ - -- - - + --------
\ 2 2 / \/ \/ 4 64 8 d
Solution #1
_________________________________________________
/ _________________________
/ / 2
/ / / 1 12*log(n)\
/ ___ \ / / |- - + ---------|
| 1 \/ 3 *I| / / \ 4 d / 1 1 6*log(n) 1 1
- |- - - -------|*3 / / ------------------ - -- - - + -------- + - - ---------------------------------------------------------------------------
\ 2 2 / \/ \/ 4 64 8 d 2 _________________________________________________
/ _________________________
/ / 2
/ / / 1 12*log(n)\
/ ___ \ / / |- - + ---------|
| 1 \/ 3 *I| / / \ 4 d / 1 1 6*log(n)
4*|- - - -------|*3 / / ------------------ - -- - - + --------
\ 2 2 / \/ \/ 4 64 8 d
Solution #2
_________________________________________________
/ _________________________
/ / 2
/ / / 1 12*log(n)\
/ / |- - + ---------|
/ / \ 4 d / 1 1 6*log(n) 1 1
- 3 / / ------------------ - -- - - + -------- + - - -----------------------------------------------------------
\/ \/ 4 64 8 d 2 _________________________________________________
/ _________________________
/ / 2
/ / / 1 12*log(n)\
/ / |- - + ---------|
/ / \ 4 d / 1 1 6*log(n)
4*3 / / ------------------ - -- - - + --------
\/ \/ 4 64 8 d
Numerical evaluation with lambdify:
d=10, n=100, eps=(1.4421024856682543+1.3829935938054487j)
d=100, n=100, eps=(1.0113474568391143+0.18554962906858613j)
d=1000, n=100, eps=(nan+nanj)
d=10000, n=100, eps=(nan+nanj)
@jaidevd
Copy link

jaidevd commented Jan 9, 2012

Well, I have no idea what this is. It will be some time before I even understand the problem. However, I wrote an implementation of Sammon's mapping in MATLAB last year. JL appears to be similar to Sammon, in that it maps points from a higher-dimensional space to lower dimensions while preserving Euclidean distances. I know this is a very crude likeness, but are the two indeed similar? Where do you intend to use this?

@ogrisel
Copy link
Author

ogrisel commented Jan 9, 2012

JL is just a theoretical bound on the distortion by a mapping performed by a projection with a random matrix (e.g. a random Gaussian matrix for instance). I in intend to use this as a helper function to predict a bound on the quality of the embedding by a random projection in the scikit-learn machine learning library: https://github.com/ogrisel/scikit-learn/blob/random_projection/sklearn/random_projection.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment