Skip to content

Instantly share code, notes, and snippets.

@thomvolker
Created April 9, 2024 19:40
Show Gist options
  • Save thomvolker/bf9634f1d10d7584aa5deb2ac92ddbac to your computer and use it in GitHub Desktop.
Save thomvolker/bf9634f1d10d7584aa5deb2ac92ddbac to your computer and use it in GitHub Desktop.

Density ratio estimation through Bregman divergence optimization

Thom Benjamin Volker


The density ratio estimation problem is to estimate the ratio of two probability density functions $p(x)$ and $q(x)$ from samples $\{nu_i\}_{i=1}^n$ and $\{de_i\}_{i=1}^m$ drawn from $p_{nu}(x)$ and $p_{de}(x)$, respectively. The density ratio estimation problem is important in many machine learning applications, such as domain adaptation, covariate shift, and importance weighting. Here, we proceed by specifying a model for the density ratio itself, instead of for the numerator and denominator density and taking their ratio. We estimate the density ratio model by minimizing the Bregman divergence between the empirical density ratio and the true density ratio.

To do this, we first generate some simple data. We draw $n_{nu} = n_{de} = 100$ samples from two normal distributions with varying means and variances, such that

$$\begin{aligned} p_{nu}(x) &= \mathcal{N}(1, 1) \\\ p_{de}(x) &= \mathcal{N}(0, 2). \end{aligned}$$

set.seed(12)
n <- 100
nu <- rnorm(n, 1, 1)
de <- rnorm(n, 0, 2)

We now have the following two densities:

curve(dnorm(x, 1, 1), -5, 5, col = "blue", lwd = 2)
curve(dnorm(x, 0, 2), -5, 5, col = "red", lwd = 2, add = TRUE)
legend("topleft", legend = c("nu", "de"), col = c("blue", "red"), lwd = 2)

We define a kernel model for the density ratio as $$\hat{r}(x) = \boldsymbol{K}(x, c) \boldsymbol{\alpha},$$ where $\boldsymbol{\alpha}$ are the parameters to be estimated, and $\boldsymbol{K}(x, ce)$ is a kernel function with centers $ce$. We use the Gaussian kernel for this purpose, which is defined as follows:

$$\begin{aligned} K^{nu}_{i,j} = \exp\left(-\frac{||nu_i - ce_j||^2_2}{2\sigma^2} \right) \\\ K^{de}_{i,j} = \exp\left(-\frac{||de_i - ce_j||^2_2}{2\sigma^2} \right), \end{aligned}$$

where $ce_j$ are $n_{ce} = 100$ evenly spaced centers between the minimum and maximum of the combined data, and $\sigma = 1$ is the bandwidth parameter.

nce <- 100
ce  <- seq(min(c(nu, de)), max(c(nu, de)), length.out = nce)

Dnu <- densityratio:::distance(as.matrix(nu), as.matrix(ce), F)
Dde <- densityratio:::distance(as.matrix(de), as.matrix(ce), F)
Knu <- densityratio:::kernel_gaussian(Dnu, 1)
Kde <- densityratio:::kernel_gaussian(Dde, 1)

Accordingly, we have the following true density ratio function:

curve(dnorm(x, 1, 1)/dnorm(x, 0, 2), -5, 5, lwd = 2)
legend("topleft", legend = "True density ratio", lwd = 2)

Note that the density ratio is not itself a density, in fact, the integral does not necessarily converge.

Bregman divergence

We define the Bregman divergence as

$$\text{BR}_f = \int p_{de}(x) \left(f(r(x)) - f(\hat{r}(x)) - f'(\hat{r}(x))(r(x) - \hat{r}(x)) \right) \text{d}x,$$

where $f$ is a differentiably and strictly convex function with derivative $f'$. Note that the first part

$$\int p_{de}(x) f(r(x)) \text{d}x,$$

is a constant independent of the density ratio model $\hat{r}(x)$ and can thus be ignored. Hence, we are left with the the part

$$\tilde{\text{BR}}_f = \int p_{de}(x) \left(f'(\hat{r}(x))\hat{r}(x) - f(\hat{r}(x)) \right) \text{d}x - \int p_{nu}(x) f'(\hat{r}(x)) \text{d}x.$$

We can estimate this quantity with the following empirical approximation: $$\hat{\text{BR}}_f = \frac{1}{n_{de}} \sum_{i = 1}^{n_{de}} \left( f'(\boldsymbol{K}^{de}\boldsymbol\alpha)\boldsymbol{K}^{de}\boldsymbol\alpha - f(\boldsymbol{K}^{de}\boldsymbol\alpha)\right) - \frac{1}{n_{nu}} \sum_{i = 1}^{n_{nu}} f'(\boldsymbol{K}^{nu}\boldsymbol\alpha),$$ which is the objective function we aim to minimize over the model parameters $\boldsymbol{\alpha}$. In R, this can be written as follows:

BR <- function(alpha, Knu, nnu, Kde, nde, lambda = 0, fx, gx) {
  rde <- Kde %*% alpha
  rnu <- Knu %*% alpha
  mean(gx(rde) * rde - fx(rde)) - mean(gx(rnu)) + lambda/2 * sum(alpha^2)
}

Unconstrained least-squares importance fitting through Bregman divergence minimization

If we set $f(x) = \frac{1}{2}(x - 1)^2$, the Bregman divergence becomes the squared loss (also known as the Pearson divergence), which reduces to $$\hat{\text{BR}}_{pe} = \frac{1}{2n_{de}} \sum_{i = 1}^{n_{de}} \left(\boldsymbol{K}^{de} \alpha \right)^2 - \frac{1}{n_{nu}} \sum_{i = 1}^{n_{nu}} \left(\boldsymbol{K}^{nu} \alpha \right).$$ This objective function coincides with the ulsif() function in the densityratio package, which includes an additional regularization term $\frac{\lambda}{2} \boldsymbol{\alpha}^T\boldsymbol{\alpha}$.

Estimating the parameters of this model can be done using the optim() function in R. We first define the objective function as follows,

f_pe <- function(x) 1/2 * (x - 1)^2
g_pe <- function(x) x - 1

BR(runif(nce), Knu, n, Kde, n, fx = f_pe, gx = g_pe)
#> [1] 84.67895

We can now estimate the parameters of the model using the optim() function. We first estimate the parameters without regularization, and then with regularization.

out <- optim(rep(0, nce), BR, Knu = Knu, nnu = n, Kde = Kde, 
             nde = n, lambda = 0, fx = f_pe, gx = g_pe,
             method = "BFGS")

out_reg <- optim(rep(0, nce), BR, Knu = Knu, nnu = n, Kde = Kde, 
                 nde = n, lambda = 1, fx = f_pe, gx = g_pe,
                 method = "BFGS",
                 control = list(maxit = 10000, reltol = 1e-16))

We can now plot the estimated density ratio function and compare it to the true density ratio function.

curve(dnorm(x, 1, 1)/dnorm(x, 0, 2), -5, 5, lwd = 2, ylim = c(0, 4))
points(nu, Knu %*% out$par, col = "blue")
points(nu, Knu %*% out_reg$par, col = "red")
legend("topleft", legend = c("True density ratio", "Estimated density ratio", "Estimated density ratio (regularized)"), lwd = 2, col = c("black", "blue", "red"))

We can also compare the estimated density ratio function to the one estimated by the ulsif() function in the densityratio package.

library(densityratio)
u <- ulsif(nu, de, centers = ce, sigma = 1, lambda = 1, intercept = FALSE)

curve(dnorm(x, 1, 1)/dnorm(x, 0, 2), -5, 5, lwd = 2, ylim = c(0, 4))
points(nu, Knu %*% out$par, col = "blue")
points(nu, Knu %*% out_reg$par, col = "red")
points(nu, predict(u, nu), col = "green")
legend("topleft", legend = c("True density ratio", "Estimated density ratio", "Estimated density ratio (regularized)", "Estimated density ratio (ulsif)"), lwd = 2, col = c("black", "blue", "red", "green"))

And we see that the solution is exactly the same.

all.equal(c(Knu %*% out_reg$par), c(predict(u, nu)))
#> [1] TRUE

Kullback-Leibler importance estimation procedure through Bregman divergence minimization

Now, if we change the function $f$ to $x\log x$, the Bregman divergence becomes the Kullback-Leibler divergence, which we know from the kliep() function in the densityratio-package. Note that we now have to use a different optimization routine, as negative density ratio values cannot be allowed (this crashes the function due to the $\log x$ term). Hence, we use the L-BFGS-B method, which allows for box constraints.

f_kl <- function(x) x * log(x) - x
g_kl <- function(x) log(x)

BR(runif(nce), Knu, n, Kde, n, fx = f_kl, gx = g_kl)
#> [1] 11.1285
out_KL <- optim(rep(1e-12, nce), BR, Knu = Knu, nnu = n, Kde = Kde,
                nde = n, lambda = 0, fx = f_kl, gx = g_kl,
                method = "L-BFGS-B",
                control = list(maxit = 10000, reltol = 1e-16),
                lower = 0)
#> Warning in optim(rep(1e-12, nce), BR, Knu = Knu, nnu = n, Kde = Kde, nde = n, :
#> method L-BFGS-B uses 'factr' (and 'pgtol') instead of 'reltol' and 'abstol'

k <- kliep(nu, de, centers = ce, sigma = 1, cv = TRUE, 
           maxit = 100000, epsilon = 10^{1:-8})
#> Warning in check.nfold(cv, nfold, sigma, nnu): Only a single 'sigma' value is specified, while cross-validation serves to optimize 'sigma'.
curve(dnorm(x, 1, 1)/dnorm(x, 0, 2), -5, 5, lwd = 2, ylim = c(0, 4))
points(nu, Knu %*% out_KL$par, col = "purple")
points(nu, predict(k, nu), col = "orange")

Now we see that the solutions differ, which is likely due to the fact that the kliep() function uses a somewhat different optimization routine. That is, kliep() uses a coordinate descent algorithm with some additional constraints. Particularly, kliep() uses a gradient descent algorithm in which negative parameters are set to zero in each iteration, and the density ratio model is normalized such that $\int r(x)p_{de}(x) = \int p_{nu}(x) = 1$.

Created on 2024-04-09 with reprex v2.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment