Density ratio estimation can be used to construct weights for analyses, that is, if the sample of interest does
not correspond to the target population in some way, one can reweight the sample of interest such that it is
closer to the target population. One way to do that is by estimating the density ratio
As shown below, this approach can also be used to reweight analyses on synthetic data, when the synthetic data
generation process differs from the true data generating process. In the following example, we have
Additionally, we have three synthetic data generation mechanisms, which are defined as follows:
where
The density ratio R
we use the package densityratio
for this.
Additionally, we consider three weighting approaches. The first one is the unweighted analyses, which serves as
as benchmark. The second one optimizes the regularization parameter
Subsequently, we perform weighted least squares, in which the variable X_1 is regressed on X_2.
In R
, this can be done as follows.
# Load packages
library(densityratio)
library(ggplot2)
library(dplyr)
library(purrr)
# Set seed for reproducibility
set.seed(123)
# Set sample size
N <- 250
# Set number of variables
P <- 2
# Set covariance structure
S <- diag(P)
rho <- 0.3
S[S==0] <- rho
# Generate observed data
Xreal <- rnorm(N*P) |> matrix(N, P) %*% chol(S)
# Generate syn1
Xsyn1 <- rnorm(N*P) |> matrix(N, P)
# Generate syn2
Xsyn2 <- runif(N*P, -2, 4) |> matrix(N, P)
# Generate syn3
Xsyn3 <- rnorm(N*P, 1, sqrt(2)) |> matrix(N, P)
# density ratio weights with cross-validation for kernel width and lambda
r1a <- densityratio::ulsif(Xreal, Xsyn1)
r2a <- densityratio::ulsif(Xreal, Xsyn2)
r3a <- densityratio::ulsif(Xreal, Xsyn3)
# density ratio weights with cross-validation for kernel width (no shrinkage)
r1b <- densityratio::ulsif(Xreal, Xsyn1, lambda = 0)
r2b <- densityratio::ulsif(Xreal, Xsyn2, lambda = 0)
r3b <- densityratio::ulsif(Xreal, Xsyn3, lambda = 0)
# Real data analysis
real_analysis <- lm(Xreal[,1] ~ Xreal[,2])
# First synthetic data analysis (correct mean, incorrect covariance)
unweighted1 <- lm(Xsyn1[,1] ~ Xsyn1[,2])
# Weighted analysis after model selection for sigma and lambda
weighted1 <- lm(Xsyn1[,1] ~ Xsyn1[,2], weights = predict(r1a, Xsyn1) |> pmax(0))
# Weighted analysis after model selection for sigma only (no regularization)
weighted_noreg1 <- lm(Xsyn1[,1] ~ Xsyn1[,2], weights = predict(r1b, Xsyn1) |> pmax(0))
# Second synthetic data analysis (incorrect mean, incorrect covariance
# and incorrect distribution (uniform(-2, 4)))
unweighted2 <- lm(Xsyn2[,1] ~ Xsyn2[,2])
# Weighted analysis after model selection for sigma and lambda
weighted2 <- lm(Xsyn2[,1] ~ Xsyn2[,2], weights = predict(r2a, Xsyn2) |> pmax(0))
# Weighted analysis after model selection for sigma only (no regularization)
weighted_noreg2 <- lm(Xsyn2[,1] ~ Xsyn2[,2], weights = predict(r2b, Xsyn2) |> pmax(0))
# Third synthetic data analysis (incorrect mean, incorrect covariance
# correct distribution)
unweighted3 <- lm(Xsyn3[,1] ~ Xsyn3[,2])
# Weighted analysis after model selection for sigma and lambda
weighted3 <- lm(Xsyn3[,1] ~ Xsyn3[,2], weights = predict(r3a, Xsyn3) |> pmax(0))
# Weighted analysis after model selection for sigma only (no regularization)
weighted_noreg3 <- lm(Xsyn3[,1] ~ Xsyn3[,2], weights = predict(r3b, Xsyn3) |> pmax(0))
# Plot data
ggplot() +
geom_point(aes(x = Xsyn1[,1], y = Xsyn1[,2], col = "Xsyn1")) +
geom_point(aes(x = Xsyn2[,1], y = Xsyn2[,2], col = "Xsyn2")) +
geom_point(aes(x = Xsyn3[,1], y = Xsyn3[,2], col = "Xsyn3")) +
geom_point(aes(x = Xreal[,1], y = Xreal[,2], col = "Xreal")) +
theme_minimal() +
scale_color_viridis_d()
Although the points are a bit overlapping, the elliptical distribution of the real data can be seen in purple (darkest), blue (slightly lighter) corresponds to the distribution in which only the covariances are misspecified, green (again slightly lighter) corresponds to the distribution itself is misspecified, and yellow (lightest) corresponds to a multivariate distribution with both the means and (co-)variances misspecified.
list(
Real = real_analysis,
Unw1 = unweighted1,
Unw2 = unweighted2,
Unw3 = unweighted3,
Wght1 = weighted1,
Wght2 = weighted2,
Wght3 = weighted3,
WghtNoReg1 = weighted_noreg1,
WghtNoReg2 = weighted_noreg2,
WghtNoReg3 = weighted_noreg3
) |>
map_dfr(~ coef(.x) |> setNames(NULL)) |>
tidyr::pivot_longer(cols = -c(Real),
names_to = c(".value", "SynMethod"),
names_pattern = "(.*)(.)") |>
mutate(Coef = rep(c("b0", "b1"), each = 3)) |>
select(Coef, SynMethod, Real, Unw, Wght, WghtNoReg)
#> # A tibble: 6 × 6
#> Coef SynMethod Real Unw Wght WghtNoReg
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 b0 1 -0.0315 -0.0348 -0.0883 -0.0407
#> 2 b0 2 -0.0315 1.01 -0.0735 -0.0229
#> 3 b0 3 -0.0315 1.01 0.0108 -0.0114
#> 4 b1 1 0.321 0.0184 0.243 0.298
#> 5 b1 2 0.321 -0.0347 0.246 0.0820
#> 6 b1 3 0.321 0.0428 0.234 0.278
Created on 2023-09-18 with reprex v2.0.2
Accordingly, it can be seen that the unweighted analyses leads to substantial bias for all three synthetic data sets in both the intercept (except for the first synthetic dataset in which the mean structure is correct) and the slope. When reweighting the analyses, the bias decreases for all three synthetic data sets, but there is substantial bias in the slopes. When performing no regularization, the bias decreases somewhat for the first and third synthetic data set, but substantially increases for the second synthetic data set, which is generated from the bivariate uniform distribution.