ci_overlap <- function(obs_l, obs_u, syn_l, syn_u) {
obs_ol <- (min(obs_u, syn_u) - max(obs_l, syn_l)) / (obs_u - obs_l)
syn_ol <- (min(obs_u, syn_u) - max(obs_l, syn_l)) / (syn_u - syn_l)
(obs_ol + syn_ol) / 2
}
set.seed(123)
nsim <- 1000000
n <- 1000
sim_ci <- function(n) {
obs <- rnorm(n)
syn <- rnorm(n, mean(obs), sd(obs))
ci_obs <- confint(lm(obs ~ 1))
ci_syn <- confint(lm(syn ~ 1))
ci_overlap(
ci_obs[1, 1],
ci_obs[1, 2],
ci_syn[1, 1],
ci_syn[1, 2]
)
}
cl <- parallel::makeCluster(18)
parallel::clusterExport(cl, list("sim_ci", "n", "ci_overlap"))
pbapply::pboptions(type = "none")
ci_overlaps <- pbapply::pbreplicate(nsim, sim_ci(n), cl = cl)
parallel::stopCluster(cl)
Consider two CI's, one for the observed data and one for the synthetic data. We can consider the observed data CI fixed, as the synthetic data is always generated with the parameters of the observed data. Accordingly, the expected CI of the observed data is equal to 0 +/- 1.96 * s / sqrt(n). The expected CI of the synthetic data is pretty similar, but typically not the same. As the variance of the synthetic data is on average the same as the variance of the observed data, the width of the CI obtained on the synthetic data is on average equal to the width of the CI obtained on the observed data. However, the CI obtained on the synthetic data is typically one average absolute deviation away from the CI obtained on the observed data. Accordingly, we can fill in the following numbers
s <- 1 / sqrt(n) # standard error of the mean in the observed and synthetic data
z <- qt(0.975, df = n - 1) # quantiles of t distribution with df = n - 1
d <- (2 * s) / (sqrt(2 * pi)) # average absolute deviation of synthetic data estimate
The CI overlap equals:
Then, the expectations of the upper and lower bounds of the observed data CI
are equal to
lo <- -z * s
uo <- z * s
Then, with probability 0.5, the synthetic data estimate is higher than the
observed data estimate. If this is the case, the expected deviation equals
ls <- d + lo
us <- d + uo
Let us fill in the formula of the CI overlap with these numbers.
((min(uo, us) - max(lo, ls)) / (uo - lo) + ((min(uo, us) - max(lo, ls)) / (us - ls))) / 2
#> [1] 0.7967009
mean(ci_overlaps)
#> [1] 0.7966697
Follow-up questions
What happens if you do proper synthesis (i.e., the parameters in the synthetic data model are included with noise)? Then you have two independent deviations that you have to keep track of, and add them together.
What happens if you don't have a single variance parameter, but a variance- covariance matrix?
Created on 2024-01-19 with reprex v2.0.2