Last active
March 5, 2025 15:11
-
-
Save mihaiconstantin/bd22dbb482457caabe2b0299d188e680 to your computer and use it in GitHub Desktop.
This gist presents the general workflow for a Structural Equation Modeling (SEM) simulation, with an accent on sample size analysis.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # | |
# Example of simple SEM simulation # | |
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # | |
# # | |
# Author(s): # | |
# - Mihai Constantin ([email protected]) # | |
# # | |
# File description: # | |
# - This file contains the general workflow for a Structural Equation # | |
# Modeling (SEM) simulation, with an accent on sample size analysis. # | |
# # | |
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # | |
# # | |
# License: # | |
# - The contents of this file (i.e., including the comments, source # | |
# code, and other text) by Mihai Constantin are licensed under the # | |
# CC BY 4.0 license. See https://creativecommons.org/licenses/by/4.0. # | |
# # | |
# You can provide attribution by citing as follows: # | |
# # | |
# Constantin, M. (2025, March 5). Example of simple SEM simulation in `R` # | |
# with an accent on the sample size. # | |
# https://gist.github.com/mihaiconstantin/bd22dbb482457caabe2b0299d188e680 # | |
# # | |
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # | |
# Start by loading the necessary libraries. | |
library(lavaan) | |
library(ggplot2) | |
library(progress) | |
# Before starting any simulation, it helps to think of what the goal is. What do | |
# we want to achieve with the simulation? In this case, we want to evaluate the | |
# impact of sample size on the model fit. Note that using causal language is | |
# warranted, since that is exactly what a simulation is, an experiment where we | |
# manipulate the sample size to see how it affects the model fit. Therefore, we | |
# our dependent variable is the model fit, and the independent variable is the | |
# sample size. | |
# That being said, let's start by determining what "good model fit" means. This | |
# is a subjective decision, but it is important to make it explicit before | |
# running the simulation. For instance, we could say that a good model fit is a | |
# heuristic where the CFI (Comparative Fit Index) is greater than 0.95 and the | |
# RMSEA (Root Mean Square Error of Approximation) is less than 0.05. It is | |
# important here to let the research question be your guide, and the literature | |
# your support. | |
# Let us define a function that evaluates the fit based on the CFI and RMSEA. | |
# This function takes in the `fit_measures` object obtained from the | |
# `lavaan::fitMeasures` function and return one if the model has good fit | |
# according to our heuristic, and zero otherwise. | |
has_good_fit <- function(fit_measures) { | |
# Extract the CFI. | |
cfi <- fit_measures["cfi"] | |
# Extract the RMSEA. | |
rmsea <- fit_measures["rmsea"] | |
# If both CFI and RMSEA are good. | |
if (cfi >= 0.95 && rmsea < 0.05) { | |
# Return `1`. | |
return(1) | |
} | |
# Otherwise, return `0`. | |
return(0) | |
} | |
# At this point, we are ready to specify the model we want to fit. For this | |
# example, we will use a simple model full SEM with four latent variables. For | |
# the measurement part of the model, we have four latent variables, each measured | |
# by four indicators. The structural part of the model consists of two latent | |
# variables, with the third regressed on the first two, and the fourth regressed | |
# on the first three. The model is specified below. | |
# Specify the model. | |
model <- " | |
# Measurement model. | |
# Define the first latent variable. | |
eta1 =~ y1 + y2 + y3 + y4 | |
# Define the second latent variable. | |
eta2 =~ y5 + y6 + y7 + y8 | |
# Define the third latent variable. | |
eta3 =~ y9 + y10 + y11 + y12 | |
# Define the fourth latent variable. | |
eta4 =~ y13 + y14 + y15 + y16 | |
# Structural model. | |
# Latent regressions. | |
eta3 ~ eta1 + eta2 | |
eta4 ~ eta1 + eta2 + eta3 | |
" | |
# Arguably the hardest part of the SEM simulation is specifying the population | |
# model. You can think of the population model as the "true" model that you are | |
# trying to estimate. In other words, if you had collected an infinite sample | |
# size, then the model you would fit would be the population model. What makes | |
# specifying the population model difficult is that you need to know the true | |
# values of the parameters. In practice, this is rarely the case, and the | |
# population model is often a best guess based on theory and previous research. | |
# Specify the population model. | |
model_population <- " | |
# Measurement model. | |
# Define the first latent variable. | |
eta1 =~ 1 * y1 + 0.70 * y2 + 0.80 * y3 + 0.65 * y4 | |
# Define the second latent variable. | |
eta2 =~ 1 * y5 + 0.65 * y6 + 0.75 * y7 + 0.80 * y8 | |
# Define the third latent variable. | |
eta3 =~ 1 * y9 + 0.75 * y10 + 0.70 * y11 + 0.65 * y12 | |
# Define the fourth latent variable. | |
eta4 =~ 1 * y13 + 0.75 * y14 + 0.65 * y15 + 0.80 * y16 | |
# Residual variances. | |
y1 ~~ y1 | |
y2 ~~ (1 - 0.70^2) * y2 | |
y3 ~~ (1 - 0.80^2) * y3 | |
y4 ~~ (1 - 0.65^2) * y4 | |
y5 ~~ y5 | |
y6 ~~ (1 - 0.65^2) * y6 | |
y7 ~~ (1 - 0.75^2) * y7 | |
y8 ~~ (1 - 0.80^2) * y8 | |
y9 ~~ y9 | |
y10 ~~ (1 - 0.75^2) * y10 | |
y11 ~~ (1 - 0.70^2) * y11 | |
y12 ~~ (1 - 0.65^2) * y12 | |
y13 ~~ y13 | |
y14 ~~ (1 - 0.75^2) * y14 | |
y15 ~~ (1 - 0.65^2) * y15 | |
y16 ~~ (1 - 0.80^2) * y16 | |
# Structural model. | |
# Latent regressions. | |
eta3 ~ 0.50 * eta1 + 0.30 * eta2 | |
eta4 ~ 0.40 * eta1 + 0.25 * eta2 + 0.35 * eta3 | |
# Latent variances. | |
# Fix exogenous factor variances to 1. | |
eta1 ~~ 1 * eta1 | |
eta2 ~~ 1 * eta2 | |
# Fix (i.e., or freely estimate the endogenous factor variances if you may). | |
eta3 ~~ 1 * eta3 | |
eta4 ~~ 1 * eta4 | |
# Latent covariances. | |
eta1 ~~ 0.30 * eta2 | |
" | |
# We need two more ingredients before we can inquire about the sample size. | |
# First, we need to be able to generate data from our population model. In | |
# simple terms, we need a way to construct the (population) model-implied | |
# covariance matrix. Then, we can use this covariance matrix to draw samples | |
# from a multivariate normal distribution. In less simple terms, recall that our | |
# model for for the vector `y` of random variables is: | |
# | |
# y = Λη + ε, | |
# | |
# where η is further modeled as: | |
# | |
# η = Bη + ζ | |
# = (I - B)⁻¹ ζ, | |
# | |
# with assumptions: | |
# | |
# y ~ 𝒩(0, Σ) | |
# ζ ~ 𝒩(0, Ψ) | |
# ε ~ 𝒩(0, Θ). | |
# | |
# Then, our model-implied covariance matrix is the variance of `y`, which is | |
# obtained by: | |
# | |
# Σ = Var(y) | |
# = Var(Λη + ε) | |
# = Var(Λη) + Var(ε) | |
# = Λ Var(η) Λᵀ + Θ | |
# = Λ Var((I - B)⁻¹ ζ) Λᵀ + Θ | |
# = Λ (I - B)⁻¹ Var(ζ) (I - B)⁻¹ᵀ Λᵀ + Θ | |
# = Λ (I - B)⁻¹ Ψ (I - B)⁻¹ᵀ Λᵀ + Θ. | |
# | |
# Therefore, through the `lavaan` syntax we first indicated which specific | |
# parameters in the matrices Λ, Ψ, and Θ we want to estimate. Then, using | |
# `lavaan` syntax, we selected values for those parameters in line with our | |
# hypothesis about what the values of those parameters are in the population. | |
# Please note that this latter step is likely the part we are most wrong about. | |
# Finally, using the values will pick, we can calculate the model-implied | |
# covariance matrix Σ as shown above, the we can use it to take samples from a | |
# multivariate normal distribution. All these steps are very conveniently | |
# abstracted away in the `lavaan::simulateData` function (see Rosseel 2012). | |
# Suppose we are interested in generating 1000 samples from the population | |
# model. We can do this as indicated below. | |
# Define the sample size. | |
sample_size <- 200 | |
# Generate data from the population model. | |
data <- simulateData(model = model_population, sample.nobs = sample_size) | |
# View the first few rows of the data. | |
head(data) | |
# The second ingredient we need is a way we fit our model to the data. In simple | |
# terms, the process of fitting a SEM model is a numerical optimization problem. | |
# The gist (i.e., pun intended) is that we select different values for our model | |
# parameters, recompute the model-implied covariance matrix Σ, and compare it to | |
# the observed covariance matrix S (i.e., the one observed in the data). We then | |
# adjust our model parameters to minimize the discrepancy between Σ and S. In | |
# most cases, when using Maximum Likelihood Estimation, this discrepancy is | |
# operationalized by a fit function defined as follows: | |
# | |
# F_ML = trace(S Σ(θ)⁻¹) - ln |S Σ(θ)⁻¹| - p, | |
# | |
# where: | |
# • trace(·) calculates the sum of the main diagonal of a matrix | |
# • ln is the natural logarithm and | · | indicates the matrix determinant | |
# • S is the sample (observed) covariance matrix | |
# • Σ(θ) is the model-implied covariance matrix | |
# • θ are the model parameters used to calculate Σ(θ) | |
# • p is the number of observed variables | |
# | |
# This expression is derived based on the assumption of multivariate normality: | |
# | |
# y ~ 𝒩(0, Σ). | |
# | |
# In the less useful case that our model fits the data perfectly (i.e., when we | |
# have zero degrees of freedom), we observe that S = Σ(θ) and the entire | |
# expression above simplifies to zero. In practice, we are not interested in | |
# such saturated models, but rather in models that are parsimonious and | |
# generalizable. | |
# This entire process of "trying out" different values for the model parameters | |
# are calculating the fit function is repeated many times until we are | |
# "satisfied" with the outcome (e.g., when further changes no longer yield | |
# meaningful improvements). This process is in fact a numerical optimization, | |
# problem and it is, again, conveniently abstracted away in the `lavaan::sem` | |
# function. We can fit our model to the data as indicated below. | |
# Fit the model to the simulated data. | |
fit <- sem(model, data = data) | |
# At this point, we can calculate a plethora of fit indices to evaluate how well | |
# our model fits the data. The `lavaan::fitMeasures` function provides a | |
# convenient way to calculate these fit indices. For instance, we can print the | |
# estimated model parameters and inspect the fit indices as indicated below. | |
# Summarize the model fit. | |
summary(fit, fit.measures = TRUE, standardized = TRUE) | |
# Provided that the sample size was "sufficiently large", we should expect to | |
# see that the estimated model parameters are close to the true values specified | |
# in the population model. | |
# Let us also evaluate our model fit based on the heuristic we defined earlier. | |
# We can do this by passing the fit measures to the `has_good_fit` function we | |
# defined earlier. | |
# First, extract the fit measures. | |
fit_measures <- fitMeasures(fit) | |
# Evaluate the model fit. | |
has_good_fit(fit_measures) | |
# We can already see from the `summary` output that the model fit is good. | |
# Therefore, we can reasonably expect the `has_good_fit` function to return `1`. | |
# Let us now try to repeat the process of another sample size. It would be a bit | |
# error-prone to do this manually, so we can leverage `R` to automate this | |
# process. To this, we can define a function. In general, you think of a | |
# function as a process that converts an input into an output. In our case, the | |
# input is the sample size, and the output is the result of the model fit | |
# evaluation according to our heuristic. The process is the exact steps we | |
# discussed above. The function does not change the process in any way, it | |
# merely makes it more convenient to repeat. | |
# Note. For the sake of clarity, we assume some arguments are defined in the | |
# global environment. This is not a good practice, and you should always aim to | |
# pass all necessary arguments to the function explicitly. However, this is but | |
# a demonstration aimed for educational purposes. | |
# Define a function to run the simulation for a single sample size. | |
run_simulation <- function(sample_size) { | |
# Attempt to fit the model. | |
output <- tryCatch( | |
expr = { | |
# Simulate data. | |
data <- lavaan::simulateData(model_population, sample.nobs = sample_size) | |
# Fit the model. | |
fit <- sem(model, data = data) | |
# Extract the fit measures. | |
fit_measures <- fitMeasures(fit) | |
# Evaluate the model fit. | |
result <- has_good_fit(fit_measures) | |
# Return the result. | |
return(result) | |
}, | |
# In case of during the data generation process, estimation, or evaluation. | |
error = function(e) { | |
# Return `NA` (i.e., a missing value). | |
return(NA) | |
} | |
) | |
# Return the result. | |
return(output) | |
} | |
# Run the simulation function for several sample sizes. | |
run_simulation(50) | |
run_simulation(100) | |
run_simulation(150) | |
# With this simulation function in hand, we can now run the simulation for | |
# multiple sample sizes. Suppose, we are interested in evaluating the model fit | |
# for sample sizes ranging from 50 to 200 in increments of 5. We can obtain | |
# those sample sizes as indicated below. | |
# Define a range of sample sizes. | |
sample_sizes <- seq(50, 200, by = 5) | |
# Print the sample sizes. | |
print(sample_sizes) | |
# In general, it is a good practice to run the simulation multiple times for | |
# each sample size. This is because our simulation process is stochastic (i.e., | |
# our generated data is a sample from a random process). Therefore, the model | |
# fit outcome is subject to random fluctuations. By running the simulation | |
# multiple times, we can obtain a more stable estimate of the model fit for each | |
# sample size. In this case, say we run the simulation 100 times for each sample | |
# size. | |
# Define the number of replications. | |
replications <- 100 | |
# While at it, let us also define a progress bar to gauge the progress of the | |
# simulation. This may come in handy, especially when running simulations that | |
# take a long time to complete. | |
# Define a progress bar. | |
bar <- progress_bar$new(total = length(sample_sizes) * replications) | |
# Run the simulation for each sample size selected | |
results <- sapply(sample_sizes, function(sample_size) { | |
# For each sample size, replicate the simulation. | |
result <- replicate( | |
# How many times to replicate the simulation. | |
n = replications, | |
# The expression to replicate (i.e., our simulation function) | |
expr = { | |
# Increment the progress bar. | |
bar$tick() | |
# Run the simulation function. | |
run_simulation(sample_size) | |
} | |
) | |
# Return the result. | |
return(result) | |
}) | |
# Once we are done with the simulation, also remove the progress bar. | |
rm(bar) | |
# Add the column names to the results for clarity. | |
colnames(results) <- sample_sizes | |
# Add the row names to the results for clarity. | |
rownames(results) <- 1:replications | |
# If we print the results, we can see that we have a matrix where each row | |
# corresponds to a replication and each column corresponds to a sample size. The | |
# values in the matrix are the results of the model fit evaluation according to | |
# our heuristic. If we go ahead and calculate the means of each column, we can | |
# obtain the proportion of good model fit for each sample size. | |
# Take the column means. | |
results_means <- colMeans(results, na.rm = TRUE) | |
# Finally, we can wrap up the simulation by plotting the results. We can plot the | |
# proportion of good model fit as a function of the sample size. We can also add | |
# a horizontal line at 0.8 to indicate some threshold of interest. | |
# Prepare the data. | |
data <- data.frame( | |
sample_size = sample_sizes, | |
proportion_good_fit = results_means | |
) | |
# Plot the results. | |
ggplot(data, aes(x = sample_size, y = proportion_good_fit)) + | |
# Plot the line. | |
geom_line( | |
color = "gray", | |
size = 1.5 | |
) + | |
# Add vertical lines from the points to each sample size on the x-axis. | |
geom_segment( | |
aes(xend = sample_size, yend = 0), | |
color = "black", | |
linetype = "dotted", | |
size = 0.3 | |
) + | |
# Add the point values. | |
geom_point( | |
size = 2.5 | |
) + | |
# Add a horizontal line at 0.8. | |
geom_hline( | |
yintercept = 0.8, | |
linetype = "dashed", | |
color = "darkred" | |
) + | |
# Restrict the y-axis. | |
scale_y_continuous( | |
limits = c(0, 1), | |
breaks = seq(0, 1, by = 0.1) | |
) + | |
# Add x-axis labels. | |
scale_x_continuous( | |
breaks = sample_sizes | |
) + | |
# Lab names. | |
labs( | |
title = paste0("Proportion of \"Good Fit\" by Sample Size (based on ", replications, " replications)"), | |
x = "Sample Size", | |
y = "Proportion of Good Fit" | |
) + | |
# Set the main theme. | |
theme_bw() + | |
# Adjust the theme. | |
theme( | |
# Increase font size of the labels. | |
axis.text = element_text(size = 12), | |
axis.title = element_text(size = 14), | |
# Hide major grid. | |
panel.grid.major = element_blank(), | |
# Rotate the x-axis labels. | |
axis.text.x = element_text(angle = 45, hjust = 1), | |
# Add some space between the plot and the labels. | |
axis.title.y = element_text( | |
margin = margin(t = 0, r = 5, b = 0, l = 0) | |
), | |
axis.title.x = element_text( | |
margin = margin(t = 15, r = 0, b = 0, l = 0) | |
) | |
) | |
# Through this document, we have discussed the general workflow for a Structural | |
# Equation Modeling (SEM) simulation, with an accent on sample size analysis. We | |
# have also discussed the importance of defining what "good model fit" means, | |
# and how to evaluate the model fit based on the heuristic we defined. We have | |
# also discussed the importance of replicating our simulation to obtain a more | |
# stable estimate of the model fit. Finally, we saw how to plot the results of | |
# the simulation to visualize the impact of sample size on the model fit. Of | |
# course, we only scratched the surface here, and there are many more aspects of | |
# SEM simulations that we could explore, as well as ways to make these | |
# simulations execute faster (e.g., using parallelization via the `parabar` | |
# package). Additionally, for another interesting approach to calculating sample | |
# requirements in SEM, readers may also consider consulting the paper Jak et al. | |
# (2021). Nevertheless, I hope this document can serve as a gentle starting | |
# point for anyone interested in SEM simulations. | |
# References | |
# | |
# Jak, S., Jorgensen, T. D., Verdam, M. G. E., Oort, F. J., & Elffers, L. | |
# (2021). Analytical power calculations for structural equation modeling: A | |
# tutorial and Shiny app. Behavior Research Methods, 53(4), 1385–1406. | |
# https://doi.org/10.3758/s13428-020-01479-0 | |
# | |
# Rosseel, Y. (2012). Lavaan: An R Package for Structural Equation Modeling. | |
# Journal of Statistical Software, 48(2). https://doi.org/10.18637/jss.v048.i02 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Example of a plot generated by following the steps discussed above.