Divergence-based testing using density ratio estimation techniques

Thom Benjamin Volker

In this notebook, we explore properties of divergence-based two-sample testing, using the methods employed in the densityratio package (Volker et al., 2023). We consider three settings: two groups are drawn from the same distribution, two groups are drawn from distributions with different means, but with the same variance-covariance matrix, and two groups are drawn from distributions with the same means but different covariances. Subsequently, we compare the performance of the divergence-based estimators and evaluate type I and type II error rates under the different scenarios. Additionally, we compare the performance of the divergence-based estimators with the performance of the Hotelling’s $T^2$ test and the Fasano-Franceschini test, which is a multivariate extension of the well-known Kolmogorov-Smirnov test.

Data-generating mechanism

In setting one, we generate two groups of $n_\text{nu} = n_\text{de} = 200$ samples from the same $5$-dimensional standard normal distribution. In setting two, we vary the means of the two groups: the first group keeps zero mean vector, while the second group has a mean vector of $\mathbf{0.1}$. In setting three, we keep the mean vector of both groups at $\mathbf{0}$, but we vary the covariance matrices: the first group has a covariance matrix of $V_1 = \text{diag}(1, 1, 1, 1, 1)$, while the second group has a covariance matrix of $V_3 = 0.3 + 0.7 \cdot \text{diag}(1, 1, 1, 1, 1)$.

library(densityratio)

set.seed(123)

nsim <- 500
P <- 5
N <- 200

M1 <- rep(0, P)
M2 <- rep(0.1, P)

V1 <- diag(P)
V3 <- 0.3 + 0.7 * diag(P)

gen_data <- function(N, P, Mnu, Mde, Vnu, Vde) {
  munu <- matrix(rep(Mnu, each = N), N)
  mude <- matrix(rep(Mde, each = N), N)
  nu <- (rnorm(N*P) |> matrix(N)) %*% chol(Vnu) + munu
  de <- (rnorm(N*P) |> matrix(N)) %*% chol(Vde) + mude
  list(nu = nu, de = de)
}

Subsequently, we write a function that tests whether the groups of samples are drawn from the same underlying distribution using different tests, and report the $p$-value for each test.

two_sample_test <- function(nu, de) {
  hotelling <- DescTools::HotellingsT2Test(nu, de)$p.value[1]
  kolmogorov <- fasano.franceschini.test::fasano.franceschini.test(nu, de, verbose = FALSE)$p.value
  ulsif_est    <- summary(
    ulsif(nu, de, nsigma = 10, nlambda = 5, ncenters = 100, progressbar = FALSE), 
    test = TRUE,
    progressbar = FALSE)$p_value
  kliep_est    <- summary(
    kliep(nu, de, nsigma = 10, ncenters = 100, progressbar = FALSE), 
    test = TRUE, 
    progressbar = FALSE)$p_value
  kmm_est      <- summary(
    kmm(nu, de, nsigma = 10, ncenters = 100, progressbar = FALSE), 
    test = TRUE, 
    progressbar = FALSE)$p_value
  spectral_est <- summary(
    spectral(nu, de, nsigma = 10, ncenters = 100, progressbar = FALSE), 
    test = TRUE,
    progressbar = FALSE)$p_value
  naive_est    <- summary(naive(nu, de), test = TRUE, progressbar = FALSE)$p_value
  
  c(
    hotelling = hotelling,
    kolmogorov = kolmogorov,
    ulsif = ulsif_est,
    kliep = kliep_est,
    kmm = kmm_est,
    spectral = spectral_est,
    naive = naive_est
  )
}

Setting 1: Same distribution

We first generate two groups of samples from the same standard normal distribution, to evaluate type I error rates of each method. We repeat this procedure 500 times, and report the proportion of $p$-values below the nominal level.

cl <- parallel::makeCluster(18)
invisible(parallel::clusterCall(cl, fun = \() {
  library(densityratio)
  pbapply::pboptions(type = "none")
}))
parallel::clusterExport(cl, c("N", "P", "M1", "V1", "gen_data", "two_sample_test"))
out1 <- pbapply::pbreplicate(
  nsim, {
    dat <- gen_data(N, P, M1, M1, V1, V1)
    two_sample_test(dat$nu, dat$de)
  },
  cl = cl
)

parallel::stopCluster(cl)

rowMeans(out1 < 0.05)

         hotelling kolmogorov.p-value              ulsif              kliep 
             0.040              0.050              0.052              0.058 
               kmm           spectral              naive 
             0.056              0.052              0.042

We can observe that, while there is some variation in the type I error rates, all seem reasonably close to the nominal $0.05$ level. That is, all methods are able to control the type I error rate.

Setting 2: Different means

We then shift to the second setting, where we generate two groups of samples from normal distributions with different means.

cl <- parallel::makeCluster(18)
invisible(parallel::clusterCall(cl, fun = \() {
  library(densityratio)
  pbapply::pboptions(type = "none")
}))
parallel::clusterExport(cl, c("N", "P", "M1", "M2", "V1", "gen_data", "two_sample_test"))
out2 <- pbapply::pbreplicate(
  nsim, {
    dat <- gen_data(N, P, M1, M2, V1, V1)
    two_sample_test(dat$nu, dat$de)
  },
  cl = cl
)

parallel::stopCluster(cl)

rowMeans(out2 < 0.05)

         hotelling kolmogorov.p-value              ulsif              kliep 
             0.368              0.216              0.238              0.180 
               kmm           spectral              naive 
             0.142              0.148              0.132

Here, we notice that Hotelling’s $T^2$ test performs best, while all other methods perform more or less on par. This is not surprising, as this is exactly the scenario where Hotelling’s $T^2$ test is expected to perform well. After all, Hotelling’s $T^2$ is the uniformly most powerfull invariant test. Still, it is good to see that all other methods still perform reasonably well.

Setting 3: Different covariances

cl <- parallel::makeCluster(18)
invisible(parallel::clusterCall(cl, fun = \() {
  library(densityratio)
  pbapply::pboptions(type = "none")
}))
parallel::clusterExport(cl, c("N", "P", "M1", "V1", "V3", "gen_data", "two_sample_test"))

out3 <- pbapply::pbreplicate(
  nsim, {
    dat <- gen_data(N, P, M1, M1, V1, V3)
    two_sample_test(dat$nu, dat$de)
  },
  cl = cl
)

parallel::stopCluster(cl)

rowMeans(out3 < 0.05)

         hotelling kolmogorov.p-value              ulsif              kliep 
             0.070              0.750              0.550              0.950 
               kmm           spectral              naive 
             1.000              0.955              0.540

Finally, in the regime where the two groups have different covariance matrices, we see that the performance of Hotelling’s $T^2$ test deteriorates (as expected, as it can only detect differences in means), while the divergence-based tests (particularly kliep, kmm and spectral) really excel, and even outperform the Fasano-Franceschini test. Hence, the divergence-based tests can be very useful in the case where two groups are expected to differ in higher-order moments.

Cite as

Volker, T. B. (2025). Divergence-based testing using density ratio estimation techniques. https://gist.github.com/thomvolker/58197e535ec458752bccbb5b611046ce