Skip to content

Instantly share code, notes, and snippets.

@mihaiconstantin
Last active March 5, 2025 15:11
Show Gist options
  • Save mihaiconstantin/bd22dbb482457caabe2b0299d188e680 to your computer and use it in GitHub Desktop.
Save mihaiconstantin/bd22dbb482457caabe2b0299d188e680 to your computer and use it in GitHub Desktop.
This gist presents the general workflow for a Structural Equation Modeling (SEM) simulation, with an accent on sample size analysis.
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# Example of simple SEM simulation #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# #
# Author(s): #
# - Mihai Constantin ([email protected]) #
# #
# File description: #
# - This file contains the general workflow for a Structural Equation #
# Modeling (SEM) simulation, with an accent on sample size analysis. #
# #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# #
# License: #
# - The contents of this file (i.e., including the comments, source #
# code, and other text) by Mihai Constantin are licensed under the #
# CC BY 4.0 license. See https://creativecommons.org/licenses/by/4.0. #
# #
# You can provide attribution by citing as follows: #
# #
# Constantin, M. (2025, March 5). Example of simple SEM simulation in `R` #
# with an accent on the sample size. #
# https://gist.github.com/mihaiconstantin/bd22dbb482457caabe2b0299d188e680 #
# #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# Start by loading the necessary libraries.
library(lavaan)
library(ggplot2)
library(progress)
# Before starting any simulation, it helps to think of what the goal is. What do
# we want to achieve with the simulation? In this case, we want to evaluate the
# impact of sample size on the model fit. Note that using causal language is
# warranted, since that is exactly what a simulation is, an experiment where we
# manipulate the sample size to see how it affects the model fit. Therefore, we
# our dependent variable is the model fit, and the independent variable is the
# sample size.
# That being said, let's start by determining what "good model fit" means. This
# is a subjective decision, but it is important to make it explicit before
# running the simulation. For instance, we could say that a good model fit is a
# heuristic where the CFI (Comparative Fit Index) is greater than 0.95 and the
# RMSEA (Root Mean Square Error of Approximation) is less than 0.05. It is
# important here to let the research question be your guide, and the literature
# your support.
# Let us define a function that evaluates the fit based on the CFI and RMSEA.
# This function takes in the `fit_measures` object obtained from the
# `lavaan::fitMeasures` function and return one if the model has good fit
# according to our heuristic, and zero otherwise.
has_good_fit <- function(fit_measures) {
# Extract the CFI.
cfi <- fit_measures["cfi"]
# Extract the RMSEA.
rmsea <- fit_measures["rmsea"]
# If both CFI and RMSEA are good.
if (cfi >= 0.95 && rmsea < 0.05) {
# Return `1`.
return(1)
}
# Otherwise, return `0`.
return(0)
}
# At this point, we are ready to specify the model we want to fit. For this
# example, we will use a simple model full SEM with four latent variables. For
# the measurement part of the model, we have four latent variables, each measured
# by four indicators. The structural part of the model consists of two latent
# variables, with the third regressed on the first two, and the fourth regressed
# on the first three. The model is specified below.
# Specify the model.
model <- "
# Measurement model.
# Define the first latent variable.
eta1 =~ y1 + y2 + y3 + y4
# Define the second latent variable.
eta2 =~ y5 + y6 + y7 + y8
# Define the third latent variable.
eta3 =~ y9 + y10 + y11 + y12
# Define the fourth latent variable.
eta4 =~ y13 + y14 + y15 + y16
# Structural model.
# Latent regressions.
eta3 ~ eta1 + eta2
eta4 ~ eta1 + eta2 + eta3
"
# Arguably the hardest part of the SEM simulation is specifying the population
# model. You can think of the population model as the "true" model that you are
# trying to estimate. In other words, if you had collected an infinite sample
# size, then the model you would fit would be the population model. What makes
# specifying the population model difficult is that you need to know the true
# values of the parameters. In practice, this is rarely the case, and the
# population model is often a best guess based on theory and previous research.
# Specify the population model.
model_population <- "
# Measurement model.
# Define the first latent variable.
eta1 =~ 1 * y1 + 0.70 * y2 + 0.80 * y3 + 0.65 * y4
# Define the second latent variable.
eta2 =~ 1 * y5 + 0.65 * y6 + 0.75 * y7 + 0.80 * y8
# Define the third latent variable.
eta3 =~ 1 * y9 + 0.75 * y10 + 0.70 * y11 + 0.65 * y12
# Define the fourth latent variable.
eta4 =~ 1 * y13 + 0.75 * y14 + 0.65 * y15 + 0.80 * y16
# Residual variances.
y1 ~~ y1
y2 ~~ (1 - 0.70^2) * y2
y3 ~~ (1 - 0.80^2) * y3
y4 ~~ (1 - 0.65^2) * y4
y5 ~~ y5
y6 ~~ (1 - 0.65^2) * y6
y7 ~~ (1 - 0.75^2) * y7
y8 ~~ (1 - 0.80^2) * y8
y9 ~~ y9
y10 ~~ (1 - 0.75^2) * y10
y11 ~~ (1 - 0.70^2) * y11
y12 ~~ (1 - 0.65^2) * y12
y13 ~~ y13
y14 ~~ (1 - 0.75^2) * y14
y15 ~~ (1 - 0.65^2) * y15
y16 ~~ (1 - 0.80^2) * y16
# Structural model.
# Latent regressions.
eta3 ~ 0.50 * eta1 + 0.30 * eta2
eta4 ~ 0.40 * eta1 + 0.25 * eta2 + 0.35 * eta3
# Latent variances.
# Fix exogenous factor variances to 1.
eta1 ~~ 1 * eta1
eta2 ~~ 1 * eta2
# Fix (i.e., or freely estimate the endogenous factor variances if you may).
eta3 ~~ 1 * eta3
eta4 ~~ 1 * eta4
# Latent covariances.
eta1 ~~ 0.30 * eta2
"
# We need two more ingredients before we can inquire about the sample size.
# First, we need to be able to generate data from our population model. In
# simple terms, we need a way to construct the (population) model-implied
# covariance matrix. Then, we can use this covariance matrix to draw samples
# from a multivariate normal distribution. In less simple terms, recall that our
# model for for the vector `y` of random variables is:
#
# y = Λη + ε,
#
# where η is further modeled as:
#
# η = Bη + ζ
# = (I - B)⁻¹ ζ,
#
# with assumptions:
#
# y ~ 𝒩(0, Σ)
# ζ ~ 𝒩(0, Ψ)
# ε ~ 𝒩(0, Θ).
#
# Then, our model-implied covariance matrix is the variance of `y`, which is
# obtained by:
#
# Σ = Var(y)
# = Var(Λη + ε)
# = Var(Λη) + Var(ε)
# = Λ Var(η) Λᵀ + Θ
# = Λ Var((I - B)⁻¹ ζ) Λᵀ + Θ
# = Λ (I - B)⁻¹ Var(ζ) (I - B)⁻¹ᵀ Λᵀ + Θ
# = Λ (I - B)⁻¹ Ψ (I - B)⁻¹ᵀ Λᵀ + Θ.
#
# Therefore, through the `lavaan` syntax we first indicated which specific
# parameters in the matrices Λ, Ψ, and Θ we want to estimate. Then, using
# `lavaan` syntax, we selected values for those parameters in line with our
# hypothesis about what the values of those parameters are in the population.
# Please note that this latter step is likely the part we are most wrong about.
# Finally, using the values will pick, we can calculate the model-implied
# covariance matrix Σ as shown above, the we can use it to take samples from a
# multivariate normal distribution. All these steps are very conveniently
# abstracted away in the `lavaan::simulateData` function (see Rosseel 2012).
# Suppose we are interested in generating 1000 samples from the population
# model. We can do this as indicated below.
# Define the sample size.
sample_size <- 200
# Generate data from the population model.
data <- simulateData(model = model_population, sample.nobs = sample_size)
# View the first few rows of the data.
head(data)
# The second ingredient we need is a way we fit our model to the data. In simple
# terms, the process of fitting a SEM model is a numerical optimization problem.
# The gist (i.e., pun intended) is that we select different values for our model
# parameters, recompute the model-implied covariance matrix Σ, and compare it to
# the observed covariance matrix S (i.e., the one observed in the data). We then
# adjust our model parameters to minimize the discrepancy between Σ and S. In
# most cases, when using Maximum Likelihood Estimation, this discrepancy is
# operationalized by a fit function defined as follows:
#
# F_ML = trace(S Σ(θ)⁻¹) - ln |S Σ(θ)⁻¹| - p,
#
# where:
# • trace(·) calculates the sum of the main diagonal of a matrix
# • ln is the natural logarithm and | · | indicates the matrix determinant
# • S is the sample (observed) covariance matrix
# • Σ(θ) is the model-implied covariance matrix
# • θ are the model parameters used to calculate Σ(θ)
# • p is the number of observed variables
#
# This expression is derived based on the assumption of multivariate normality:
#
# y ~ 𝒩(0, Σ).
#
# In the less useful case that our model fits the data perfectly (i.e., when we
# have zero degrees of freedom), we observe that S = Σ(θ) and the entire
# expression above simplifies to zero. In practice, we are not interested in
# such saturated models, but rather in models that are parsimonious and
# generalizable.
# This entire process of "trying out" different values for the model parameters
# are calculating the fit function is repeated many times until we are
# "satisfied" with the outcome (e.g., when further changes no longer yield
# meaningful improvements). This process is in fact a numerical optimization,
# problem and it is, again, conveniently abstracted away in the `lavaan::sem`
# function. We can fit our model to the data as indicated below.
# Fit the model to the simulated data.
fit <- sem(model, data = data)
# At this point, we can calculate a plethora of fit indices to evaluate how well
# our model fits the data. The `lavaan::fitMeasures` function provides a
# convenient way to calculate these fit indices. For instance, we can print the
# estimated model parameters and inspect the fit indices as indicated below.
# Summarize the model fit.
summary(fit, fit.measures = TRUE, standardized = TRUE)
# Provided that the sample size was "sufficiently large", we should expect to
# see that the estimated model parameters are close to the true values specified
# in the population model.
# Let us also evaluate our model fit based on the heuristic we defined earlier.
# We can do this by passing the fit measures to the `has_good_fit` function we
# defined earlier.
# First, extract the fit measures.
fit_measures <- fitMeasures(fit)
# Evaluate the model fit.
has_good_fit(fit_measures)
# We can already see from the `summary` output that the model fit is good.
# Therefore, we can reasonably expect the `has_good_fit` function to return `1`.
# Let us now try to repeat the process of another sample size. It would be a bit
# error-prone to do this manually, so we can leverage `R` to automate this
# process. To this, we can define a function. In general, you think of a
# function as a process that converts an input into an output. In our case, the
# input is the sample size, and the output is the result of the model fit
# evaluation according to our heuristic. The process is the exact steps we
# discussed above. The function does not change the process in any way, it
# merely makes it more convenient to repeat.
# Note. For the sake of clarity, we assume some arguments are defined in the
# global environment. This is not a good practice, and you should always aim to
# pass all necessary arguments to the function explicitly. However, this is but
# a demonstration aimed for educational purposes.
# Define a function to run the simulation for a single sample size.
run_simulation <- function(sample_size) {
# Attempt to fit the model.
output <- tryCatch(
expr = {
# Simulate data.
data <- lavaan::simulateData(model_population, sample.nobs = sample_size)
# Fit the model.
fit <- sem(model, data = data)
# Extract the fit measures.
fit_measures <- fitMeasures(fit)
# Evaluate the model fit.
result <- has_good_fit(fit_measures)
# Return the result.
return(result)
},
# In case of during the data generation process, estimation, or evaluation.
error = function(e) {
# Return `NA` (i.e., a missing value).
return(NA)
}
)
# Return the result.
return(output)
}
# Run the simulation function for several sample sizes.
run_simulation(50)
run_simulation(100)
run_simulation(150)
# With this simulation function in hand, we can now run the simulation for
# multiple sample sizes. Suppose, we are interested in evaluating the model fit
# for sample sizes ranging from 50 to 200 in increments of 5. We can obtain
# those sample sizes as indicated below.
# Define a range of sample sizes.
sample_sizes <- seq(50, 200, by = 5)
# Print the sample sizes.
print(sample_sizes)
# In general, it is a good practice to run the simulation multiple times for
# each sample size. This is because our simulation process is stochastic (i.e.,
# our generated data is a sample from a random process). Therefore, the model
# fit outcome is subject to random fluctuations. By running the simulation
# multiple times, we can obtain a more stable estimate of the model fit for each
# sample size. In this case, say we run the simulation 100 times for each sample
# size.
# Define the number of replications.
replications <- 100
# While at it, let us also define a progress bar to gauge the progress of the
# simulation. This may come in handy, especially when running simulations that
# take a long time to complete.
# Define a progress bar.
bar <- progress_bar$new(total = length(sample_sizes) * replications)
# Run the simulation for each sample size selected
results <- sapply(sample_sizes, function(sample_size) {
# For each sample size, replicate the simulation.
result <- replicate(
# How many times to replicate the simulation.
n = replications,
# The expression to replicate (i.e., our simulation function)
expr = {
# Increment the progress bar.
bar$tick()
# Run the simulation function.
run_simulation(sample_size)
}
)
# Return the result.
return(result)
})
# Once we are done with the simulation, also remove the progress bar.
rm(bar)
# Add the column names to the results for clarity.
colnames(results) <- sample_sizes
# Add the row names to the results for clarity.
rownames(results) <- 1:replications
# If we print the results, we can see that we have a matrix where each row
# corresponds to a replication and each column corresponds to a sample size. The
# values in the matrix are the results of the model fit evaluation according to
# our heuristic. If we go ahead and calculate the means of each column, we can
# obtain the proportion of good model fit for each sample size.
# Take the column means.
results_means <- colMeans(results, na.rm = TRUE)
# Finally, we can wrap up the simulation by plotting the results. We can plot the
# proportion of good model fit as a function of the sample size. We can also add
# a horizontal line at 0.8 to indicate some threshold of interest.
# Prepare the data.
data <- data.frame(
sample_size = sample_sizes,
proportion_good_fit = results_means
)
# Plot the results.
ggplot(data, aes(x = sample_size, y = proportion_good_fit)) +
# Plot the line.
geom_line(
color = "gray",
size = 1.5
) +
# Add vertical lines from the points to each sample size on the x-axis.
geom_segment(
aes(xend = sample_size, yend = 0),
color = "black",
linetype = "dotted",
size = 0.3
) +
# Add the point values.
geom_point(
size = 2.5
) +
# Add a horizontal line at 0.8.
geom_hline(
yintercept = 0.8,
linetype = "dashed",
color = "darkred"
) +
# Restrict the y-axis.
scale_y_continuous(
limits = c(0, 1),
breaks = seq(0, 1, by = 0.1)
) +
# Add x-axis labels.
scale_x_continuous(
breaks = sample_sizes
) +
# Lab names.
labs(
title = paste0("Proportion of \"Good Fit\" by Sample Size (based on ", replications, " replications)"),
x = "Sample Size",
y = "Proportion of Good Fit"
) +
# Set the main theme.
theme_bw() +
# Adjust the theme.
theme(
# Increase font size of the labels.
axis.text = element_text(size = 12),
axis.title = element_text(size = 14),
# Hide major grid.
panel.grid.major = element_blank(),
# Rotate the x-axis labels.
axis.text.x = element_text(angle = 45, hjust = 1),
# Add some space between the plot and the labels.
axis.title.y = element_text(
margin = margin(t = 0, r = 5, b = 0, l = 0)
),
axis.title.x = element_text(
margin = margin(t = 15, r = 0, b = 0, l = 0)
)
)
# Through this document, we have discussed the general workflow for a Structural
# Equation Modeling (SEM) simulation, with an accent on sample size analysis. We
# have also discussed the importance of defining what "good model fit" means,
# and how to evaluate the model fit based on the heuristic we defined. We have
# also discussed the importance of replicating our simulation to obtain a more
# stable estimate of the model fit. Finally, we saw how to plot the results of
# the simulation to visualize the impact of sample size on the model fit. Of
# course, we only scratched the surface here, and there are many more aspects of
# SEM simulations that we could explore, as well as ways to make these
# simulations execute faster (e.g., using parallelization via the `parabar`
# package). Additionally, for another interesting approach to calculating sample
# requirements in SEM, readers may also consider consulting the paper Jak et al.
# (2021). Nevertheless, I hope this document can serve as a gentle starting
# point for anyone interested in SEM simulations.
# References
#
# Jak, S., Jorgensen, T. D., Verdam, M. G. E., Oort, F. J., & Elffers, L.
# (2021). Analytical power calculations for structural equation modeling: A
# tutorial and Shiny app. Behavior Research Methods, 53(4), 1385–1406.
# https://doi.org/10.3758/s13428-020-01479-0
#
# Rosseel, Y. (2012). Lavaan: An R Package for Structural Equation Modeling.
# Journal of Statistical Software, 48(2). https://doi.org/10.18637/jss.v048.i02
@mihaiconstantin
Copy link
Author

Example of a plot generated by following the steps discussed above.

sem-simulation-plot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment