How density ratio estimation can be used to reweigh a sample

Density ratio estimation can be used to construct weights for analyses, that is, if the sample of interest does not correspond to the target population in some way, one can reweight the sample of interest such that it is closer to the target population. One way to do that is by estimating the density ratio $$r(x) = \frac{p_{\text{nu}}(x)}{p_{\text{de}}(x)}$$ between the two samples, and reweight the analyses of the target sample $x_{\text{de}}$ with the estimated weights $\hat{r}(x)$. If the analyses of interest is least-squares regression, the solution to the inference problem is given by the optimization problem $$(\mathbf{X^\top W X)\hat{\beta}} = \mathbf{X^\top W y},$$ where $\mathbf{X}$ is an $n \times p$ data matrix, $\mathbf{W}$ is a diagonal matrix with the weights $\hat{r}(x)$ on the diagonal, and $\mathbf{y}$ is the outcome of interest.

As shown below, this approach can also be used to reweight analyses on synthetic data, when the synthetic data generation process differs from the true data generating process. In the following example, we have $n = 250$ observed samples in the matrix $X_{\text{real}}$, which follows a multivariate Gaussian distribution $$X_{\text{real}} \sim \mathcal{N}(\mathbf{\mu}, \mathbf{\Sigma}),$$ with parameters

$$\mathbf{\mu} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} ~~~~~~~~~~~~ \mathbf{\Sigma} = \begin{bmatrix} 1 & 0.3 \\ 0.3 & 1 \end{bmatrix}.$$

Additionally, we have three synthetic data generation mechanisms, which are defined as follows:

$$\begin{aligned} X_{\text{syn}_1} &\sim \mathcal{N} \Bigg( \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \Bigg) \\\ X_{\text{syn}_2} &\sim \mathcal{U} \Bigg( \begin{bmatrix} -2 \\ -2 \end{bmatrix}, \begin{bmatrix} 4 \\ 4 \end{bmatrix} \Bigg) \\\ X_{\text{syn}_3} &\sim \mathcal{N} \Bigg( \begin{bmatrix} 1 \\ 1 \end{bmatrix}, \begin{bmatrix} \sqrt{2} & 0 \\ 0 & \sqrt{2} \end{bmatrix} \Bigg) \end{aligned}$$

where $\mathcal{U}$ denotes the multivariate uniform distribution, with parameters $\alpha$ and $\beta$, denoted by the two vectors in $\mathcal{U}$.

The density ratio $r_i(x)$ is estimated directly as $$\hat{r_i}(x) = \frac{p(X_{\text{real}})}{p(X_{\text{syn}_i})}$$ using unconstrained least-squares importance fitting. In R we use the package densityratio for this. Additionally, we consider three weighting approaches. The first one is the unweighted analyses, which serves as as benchmark. The second one optimizes the regularization parameter $\lambda$ and the kernel width $\sigma$ through cross-validation. The third method optimizes the kernel width $\sigma$ through cross-validation, but fixes the regularization parameter to $\lambda = 0$, which implies that there is no regularization performed.

Subsequently, we perform weighted least squares, in which the variable X_1 is regressed on X_2. In R, this can be done as follows.

# Load packages
library(densityratio)
library(ggplot2)
library(dplyr)
library(purrr)

# Set seed for reproducibility
set.seed(123)

# Set sample size
N <- 250  
# Set number of variables
P <- 2
# Set covariance structure
S <- diag(P)
rho <- 0.3
S[S==0] <- rho

# Generate observed data
Xreal <- rnorm(N*P) |> matrix(N, P) %*% chol(S)

# Generate syn1
Xsyn1 <- rnorm(N*P) |> matrix(N, P)
# Generate syn2
Xsyn2 <- runif(N*P, -2, 4) |> matrix(N, P)
# Generate syn3
Xsyn3 <- rnorm(N*P, 1, sqrt(2)) |> matrix(N, P)

# density ratio weights with cross-validation for kernel width and lambda
r1a <- densityratio::ulsif(Xreal, Xsyn1)
r2a <- densityratio::ulsif(Xreal, Xsyn2)
r3a <- densityratio::ulsif(Xreal, Xsyn3)

# density ratio weights with cross-validation for kernel width (no shrinkage)
r1b <- densityratio::ulsif(Xreal, Xsyn1, lambda = 0)
r2b <- densityratio::ulsif(Xreal, Xsyn2, lambda = 0)
r3b <- densityratio::ulsif(Xreal, Xsyn3, lambda = 0)

# Real data analysis
real_analysis <- lm(Xreal[,1] ~ Xreal[,2])

# First synthetic data analysis (correct mean, incorrect covariance)
unweighted1 <- lm(Xsyn1[,1] ~ Xsyn1[,2])
# Weighted analysis after model selection for sigma and lambda
weighted1 <- lm(Xsyn1[,1] ~ Xsyn1[,2], weights = predict(r1a, Xsyn1) |> pmax(0))
# Weighted analysis after model selection for sigma only (no regularization)
weighted_noreg1 <- lm(Xsyn1[,1] ~ Xsyn1[,2], weights = predict(r1b, Xsyn1) |> pmax(0))


# Second synthetic data analysis (incorrect mean, incorrect covariance
# and incorrect distribution (uniform(-2, 4)))
unweighted2 <- lm(Xsyn2[,1] ~ Xsyn2[,2])
# Weighted analysis after model selection for sigma and lambda
weighted2 <- lm(Xsyn2[,1] ~ Xsyn2[,2], weights = predict(r2a, Xsyn2) |> pmax(0))
# Weighted analysis after model selection for sigma only (no regularization)
weighted_noreg2 <- lm(Xsyn2[,1] ~ Xsyn2[,2], weights = predict(r2b, Xsyn2) |> pmax(0))

# Third synthetic data analysis (incorrect mean, incorrect covariance
# correct distribution)
unweighted3 <- lm(Xsyn3[,1] ~ Xsyn3[,2])
# Weighted analysis after model selection for sigma and lambda
weighted3 <- lm(Xsyn3[,1] ~ Xsyn3[,2], weights = predict(r3a, Xsyn3) |> pmax(0))
# Weighted analysis after model selection for sigma only (no regularization)
weighted_noreg3 <- lm(Xsyn3[,1] ~ Xsyn3[,2], weights = predict(r3b, Xsyn3) |> pmax(0))

# Plot data
ggplot() +
  geom_point(aes(x = Xsyn1[,1], y = Xsyn1[,2], col = "Xsyn1")) +
  geom_point(aes(x = Xsyn2[,1], y = Xsyn2[,2], col = "Xsyn2")) +
  geom_point(aes(x = Xsyn3[,1], y = Xsyn3[,2], col = "Xsyn3")) +
  geom_point(aes(x = Xreal[,1], y = Xreal[,2], col = "Xreal")) +
  theme_minimal() +
  scale_color_viridis_d()

Although the points are a bit overlapping, the elliptical distribution of the real data can be seen in purple (darkest), blue (slightly lighter) corresponds to the distribution in which only the covariances are misspecified, green (again slightly lighter) corresponds to the distribution itself is misspecified, and yellow (lightest) corresponds to a multivariate distribution with both the means and (co-)variances misspecified.

list(
  Real = real_analysis, 
  Unw1 = unweighted1, 
  Unw2 = unweighted2, 
  Unw3 = unweighted3,
  Wght1 = weighted1, 
  Wght2 = weighted2, 
  Wght3 = weighted3, 
  WghtNoReg1 = weighted_noreg1,
  WghtNoReg2 = weighted_noreg2,
  WghtNoReg3 = weighted_noreg3
) |>
  map_dfr(~ coef(.x) |> setNames(NULL)) |>
  tidyr::pivot_longer(cols = -c(Real), 
                      names_to = c(".value", "SynMethod"),
                      names_pattern = "(.*)(.)") |>
  mutate(Coef = rep(c("b0", "b1"), each = 3)) |>
  select(Coef, SynMethod, Real, Unw, Wght, WghtNoReg)
#> # A tibble: 6 × 6
#>   Coef  SynMethod    Real     Unw    Wght WghtNoReg
#>   <chr> <chr>       <dbl>   <dbl>   <dbl>     <dbl>
#> 1 b0    1         -0.0315 -0.0348 -0.0883   -0.0407
#> 2 b0    2         -0.0315  1.01   -0.0735   -0.0229
#> 3 b0    3         -0.0315  1.01    0.0108   -0.0114
#> 4 b1    1          0.321   0.0184  0.243     0.298 
#> 5 b1    2          0.321  -0.0347  0.246     0.0820
#> 6 b1    3          0.321   0.0428  0.234     0.278

^{Created on 2023-09-18 with reprex v2.0.2}

Accordingly, it can be seen that the unweighted analyses leads to substantial bias for all three synthetic data sets in both the intercept (except for the first synthetic dataset in which the mean structure is correct) and the slope. When reweighting the analyses, the bias decreases for all three synthetic data sets, but there is substantial bias in the slopes. When performing no regularization, the bias decreases somewhat for the first and third synthetic data set, but substantially increases for the second synthetic data set, which is generated from the bivariate uniform distribution.

thomvolker/densityratioweighting.md

How density ratio estimation can be used to reweigh a sample