What is the expected confidence interval overlap under a correct synthesis model?

ci_overlap <- function(obs_l, obs_u, syn_l, syn_u) {
  obs_ol <- (min(obs_u, syn_u) - max(obs_l, syn_l)) / (obs_u - obs_l)
  syn_ol <- (min(obs_u, syn_u) - max(obs_l, syn_l)) / (syn_u - syn_l)
  (obs_ol + syn_ol) / 2
}

set.seed(123)

nsim <- 1000000
n <- 1000

sim_ci <- function(n) {
  obs <- rnorm(n)
  syn <- rnorm(n, mean(obs), sd(obs))
  ci_obs <- confint(lm(obs ~ 1))
  ci_syn <- confint(lm(syn ~ 1))
  ci_overlap(
    ci_obs[1, 1],
    ci_obs[1, 2],
    ci_syn[1, 1],
    ci_syn[1, 2]
  )
}

cl <- parallel::makeCluster(18)
parallel::clusterExport(cl, list("sim_ci", "n", "ci_overlap"))
pbapply::pboptions(type = "none")
ci_overlaps <- pbapply::pbreplicate(nsim, sim_ci(n), cl = cl)
parallel::stopCluster(cl)

Consider two CI's, one for the observed data and one for the synthetic data. We can consider the observed data CI fixed, as the synthetic data is always generated with the parameters of the observed data. Accordingly, the expected CI of the observed data is equal to 0 +/- 1.96 * s / sqrt(n). The expected CI of the synthetic data is pretty similar, but typically not the same. As the variance of the synthetic data is on average the same as the variance of the observed data, the width of the CI obtained on the synthetic data is on average equal to the width of the CI obtained on the observed data. However, the CI obtained on the synthetic data is typically one average absolute deviation away from the CI obtained on the observed data. Accordingly, we can fill in the following numbers

s <- 1 / sqrt(n) # standard error of the mean in the observed and synthetic data
z <- qt(0.975, df = n - 1) # quantiles of t distribution with df = n - 1
d <- (2 * s) / (sqrt(2 * pi)) # average absolute deviation of synthetic data estimate

The CI overlap equals: $$\frac{\min(u_o, u_s) - \max(lo, ls)}{2(u_o - l_o)} + \frac{\min(u_o, u_s) - max(l_o, l_s)}{2(us - ls)},$$ where $u_o$ and $l_o$ are the upper and lower bounds of the observed data CI, and $u_s$ and $l_s$ are the upper and lower bounds of the synthetic data CI.

Then, the expectations of the upper and lower bounds of the observed data CI are equal to $0 \pm 1.96s$.

lo <- -z * s
uo <- z * s

Then, with probability 0.5, the synthetic data estimate is higher than the observed data estimate. If this is the case, the expected deviation equals $\frac{2s}{\sqrt{2 \pi}}$. By symmetry, the same holds for the lower bound.

ls <- d + lo
us <- d + uo

Let us fill in the formula of the CI overlap with these numbers.

((min(uo, us) - max(lo, ls)) / (uo - lo) + ((min(uo, us) - max(lo, ls)) / (us - ls))) / 2
#> [1] 0.7967009

mean(ci_overlaps)
#> [1] 0.7966697

Follow-up questions

What happens if you do proper synthesis (i.e., the parameters in the synthetic data model are included with noise)? Then you have two independent deviations that you have to keep track of, and add them together.

What happens if you don't have a single variance parameter, but a variance- covariance matrix?

^{Created on 2024-01-19 with reprex v2.0.2}

thomvolker/ci_overlap.md

What is the expected confidence interval overlap under a correct synthesis model?