mihaiconstantin · March 5, 2025 15:11 · mihaiconstantin · Mar 5, 2025
diff --git a/sem-simulation.R b/sem-simulation.R
 # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 #                    Example of simple SEM simulation                         #
 # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 #                                                                             #
 # Author(s):                                                                  #
 #   - Mihai Constantin ([email protected])                            #
 #                                                                             #
 # File description:                                                           #
 #   - This file contains the general workflow for a Structural Equation       #
 #     Modeling (SEM) simulation, with an accent on sample size analysis.      #
 #                                                                             #
 # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
 #                                                                             #
 # License:                                                                    #
 #   - The contents of this file (i.e., including the comments, source         #
 #     code, and other text) by Mihai Constantin are licensed under the        #
 #     CC BY 4.0 license. See https://creativecommons.org/licenses/by/4.0.     #
 #                                                                             #
 # You can provide attribution by citing as follows:                           #
 #                                                                             #
 #  Constantin, M. (2025, March 5). Example of simple SEM simulation in `R`    #
 #  with an accent on the sample size.                                         #
 #  https://gist.github.com/mihaiconstantin/bd22dbb482457caabe2b0299d188e680   #
 #                                                                             #
 # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #


 # Start by loading the necessary libraries.
 library(lavaan)
 library(ggplot2)
 library(progress)


 # Before starting any simulation, it helps to think of what the goal is. What do
 # we want to achieve with the simulation? In this case, we want to evaluate the
 # impact of sample size on the model fit. Note that using causal language is
 # warranted, since that is exactly what a simulation is, an experiment where we
 # manipulate the sample size to see how it affects the model fit. Therefore, we
 # our dependent variable is the model fit, and the independent variable is the
 # sample size.

 # That being said, let's start by determining what "good model fit" means. This
 # is a subjective decision, but it is important to make it explicit before
 # running the simulation. For instance, we could say that a good model fit is a
 # heuristic where the CFI (Comparative Fit Index) is greater than 0.95 and the
 # RMSEA (Root Mean Square Error of Approximation) is less than 0.05. It is
 # important here to let the research question be your guide, and the literature
 # your support.

 # Let us define a function that evaluates the fit based on the CFI and RMSEA.
 # This function takes in the `fit_measures` object obtained from the
 # `lavaan::fitMeasures` function and return one if the model has good fit
 # according to our heuristic, and zero otherwise.
 has_good_fit <- function(fit_measures) {
    # Extract the CFI.
    cfi <- fit_measures["cfi"]

    # Extract the RMSEA.
    rmsea <- fit_measures["rmsea"]

    # If both CFI and RMSEA are good.
    if (cfi >= 0.95 && rmsea < 0.05) {
        # Return `1`.
        return(1)
    }

    # Otherwise, return `0`.
    return(0)
 }

 # At this point, we are ready to specify the model we want to fit. For this
 # example, we will use a simple model full SEM with four latent variables. For
 # the measurement part of the model, we have four latent variables, each measured
 # by four indicators. The structural part of the model consists of two latent
 # variables, with the third regressed on the first two, and the fourth regressed
 # on the first three. The model is specified below.

 # Specify the model.
 model <- "
    # Measurement model.

    # Define the first latent variable.
    eta1 =~ y1 + y2 + y3 + y4

    # Define the second latent variable.
    eta2 =~ y5 + y6 + y7 + y8

    # Define the third latent variable.
    eta3 =~ y9 + y10 + y11 + y12

    # Define the fourth latent variable.
    eta4 =~ y13 + y14 + y15 + y16

    # Structural model.

    # Latent regressions.
    eta3 ~ eta1 + eta2
    eta4 ~ eta1 + eta2 + eta3
 "

 # Arguably the hardest part of the SEM simulation is specifying the population
 # model. You can think of the population model as the "true" model that you are
 # trying to estimate. In other words, if you had collected an infinite sample
 # size, then the model you would fit would be the population model. What makes
 # specifying the population model difficult is that you need to know the true
 # values of the parameters. In practice, this is rarely the case, and the
 # population model is often a best guess based on theory and previous research.

 #  Specify the population model.
 model_population <- "
    # Measurement model.

    # Define the first latent variable.
    eta1 =~ 1 * y1 + 0.70 * y2 + 0.80 * y3 + 0.65 * y4

    # Define the second latent variable.
    eta2 =~ 1 * y5 + 0.65 * y6 + 0.75 * y7 + 0.80 * y8

    # Define the third latent variable.
    eta3 =~ 1 * y9 + 0.75 * y10 + 0.70 * y11 + 0.65 * y12

    # Define the fourth latent variable.
    eta4 =~ 1 * y13 + 0.75 * y14 + 0.65 * y15 + 0.80 * y16

    # Residual variances.
    y1 ~~ y1
    y2 ~~ (1 - 0.70^2) * y2
    y3 ~~ (1 - 0.80^2) * y3
    y4 ~~ (1 - 0.65^2) * y4

    y5 ~~ y5
    y6 ~~ (1 - 0.65^2) * y6
    y7 ~~ (1 - 0.75^2) * y7
    y8 ~~ (1 - 0.80^2) * y8

    y9 ~~ y9
    y10 ~~ (1 - 0.75^2) * y10
    y11 ~~ (1 - 0.70^2) * y11
    y12 ~~ (1 - 0.65^2) * y12

    y13 ~~ y13
    y14 ~~ (1 - 0.75^2) * y14
    y15 ~~ (1 - 0.65^2) * y15
    y16 ~~ (1 - 0.80^2) * y16

    # Structural model.

    # Latent regressions.
    eta3 ~ 0.50 * eta1 + 0.30 * eta2
    eta4 ~ 0.40 * eta1 + 0.25 * eta2 + 0.35 * eta3

    # Latent variances.

    # Fix exogenous factor variances to 1.
    eta1 ~~ 1 * eta1
    eta2 ~~ 1 * eta2

    # Fix (i.e., or freely estimate the endogenous factor variances if you may).
    eta3 ~~ 1 * eta3
    eta4 ~~ 1 * eta4

    # Latent covariances.
    eta1 ~~ 0.30 * eta2
 "

 # We need two more ingredients before we can inquire about the sample size.
 # First, we need to be able to generate data from our population model. In
 # simple terms, we need a way to construct the (population) model-implied
 # covariance matrix. Then, we can use this covariance matrix to draw samples
 # from a multivariate normal distribution. In less simple terms, recall that our
 # model for for the vector `y` of random variables is:
 #
 #   y = Λη + ε,
 #
 # where η is further modeled as:
 #
 #   η = Bη + ζ
 #     = (I - B)⁻¹ ζ,
 #
 # with assumptions:
 #
 #   y ~ 𝒩(0, Σ)
 #   ζ ~ 𝒩(0, Ψ)
 #   ε ~ 𝒩(0, Θ).
 #
 # Then, our model-implied covariance matrix is the variance of `y`, which is
 # obtained by:
 #
 # Σ = Var(y)
 #   = Var(Λη + ε)
 #   = Var(Λη) + Var(ε)
 #   = Λ Var(η) Λᵀ + Θ
 #   = Λ Var((I - B)⁻¹ ζ) Λᵀ + Θ
 #   = Λ (I - B)⁻¹ Var(ζ) (I - B)⁻¹ᵀ Λᵀ + Θ
 #   = Λ (I - B)⁻¹ Ψ (I - B)⁻¹ᵀ Λᵀ + Θ.
 #
 # Therefore, through the `lavaan` syntax we first indicated which specific
 # parameters in the matrices Λ, Ψ, and Θ we want to estimate. Then, using
 # `lavaan` syntax, we selected values for those parameters in line with our
 # hypothesis about what the values of those parameters are in the population.
 # Please note that this latter step is likely the part we are most wrong about.
 # Finally, using the values will pick, we can calculate the model-implied
 # covariance matrix Σ as shown above, the we can use it to take samples from a
 # multivariate normal distribution. All these steps are very conveniently
 # abstracted away in the `lavaan::simulateData` function (see Rosseel 2012).
 # Suppose we are interested in generating 1000 samples from the population
 # model. We can do this as indicated below.

 # Define the sample size.
 sample_size <- 200

 # Generate data from the population model.
 data <- simulateData(model = model_population, sample.nobs = sample_size)

 # View the first few rows of the data.
 head(data)

 # The second ingredient we need is a way we fit our model to the data. In simple
 # terms, the process of fitting a SEM model is a numerical optimization problem.
 # The gist (i.e., pun intended) is that we select different values for our model
 # parameters, recompute the model-implied covariance matrix Σ, and compare it to
 # the observed covariance matrix S (i.e., the one observed in the data). We then
 # adjust our model parameters to minimize the discrepancy between Σ and S. In
 # most cases, when using Maximum Likelihood Estimation, this discrepancy is
 # operationalized by a fit function defined as follows:
 #
 # F_ML = trace(S Σ(θ)⁻¹) - ln |S Σ(θ)⁻¹| - p,
 #
 # where:
 #    • trace(·) calculates the sum of the main diagonal of a matrix
 #    • ln is the natural logarithm and | · | indicates the matrix determinant
 #    • S is the sample (observed) covariance matrix
 #    • Σ(θ) is the model-implied covariance matrix
 #    • θ are the model parameters used to calculate Σ(θ)
 #    • p is the number of observed variables
 #
 # This expression is derived based on the assumption of multivariate normality:
 #
 #    y ~ 𝒩(0, Σ).
 #
 # In the less useful case that our model fits the data perfectly (i.e., when we
 # have zero degrees of freedom), we observe that S = Σ(θ) and the entire
 # expression above simplifies to zero. In practice, we are not interested in
 # such saturated models, but rather in models that are parsimonious and
 # generalizable.

 # This entire process of "trying out" different values for the model parameters
 # are calculating the fit function is repeated many times until we are
 # "satisfied" with the outcome (e.g., when further changes no longer yield
 # meaningful improvements). This process is in fact a numerical optimization,
 # problem and it is, again, conveniently abstracted away in the `lavaan::sem`
 # function. We can fit our model to the data as indicated below.

 # Fit the model to the simulated data.
 fit <- sem(model, data = data)

 # At this point, we can calculate a plethora of fit indices to evaluate how well
 # our model fits the data. The `lavaan::fitMeasures` function provides a
 # convenient way to calculate these fit indices. For instance, we can print the
 # estimated model parameters and inspect the fit indices as indicated below.

 # Summarize the model fit.
 summary(fit, fit.measures = TRUE, standardized = TRUE)

 # Provided that the sample size was "sufficiently large", we should expect to
 # see that the estimated model parameters are close to the true values specified
 # in the population model.

 # Let us also evaluate our model fit based on the heuristic we defined earlier.
 # We can do this by passing the fit measures to the `has_good_fit` function we
 # defined earlier.

 # First, extract the fit measures.
 fit_measures <- fitMeasures(fit)

 # Evaluate the model fit.
 has_good_fit(fit_measures)

 # We can already see from the `summary` output that the model fit is good.
 # Therefore, we can reasonably expect the `has_good_fit` function to return `1`.

 # Let us now try to repeat the process of another sample size. It would be a bit
 # error-prone to do this manually, so we can leverage `R` to automate this
 # process. To this, we can define a function. In general, you think of a
 # function as a process that converts an input into an output. In our case, the
 # input is the sample size, and the output is the result of the model fit
 # evaluation according to our heuristic. The process is the exact steps we
 # discussed above. The function does not change the process in any way, it
 # merely makes it more convenient to repeat.

 # Note. For the sake of clarity, we assume some arguments are defined in the
 # global environment. This is not a good practice, and you should always aim to
 # pass all necessary arguments to the function explicitly. However, this is but
 # a demonstration aimed for educational purposes.

 # Define a function to run the simulation for a single sample size.
 run_simulation <- function(sample_size) {
    # Attempt to fit the model.
    output <- tryCatch(
        expr = {
            # Simulate data.
            data <- lavaan::simulateData(model_population, sample.nobs = sample_size)

            # Fit the model.
            fit <- sem(model, data = data)

            # Extract the fit measures.
            fit_measures <- fitMeasures(fit)

            # Evaluate the model fit.
            result <- has_good_fit(fit_measures)

            # Return the result.
            return(result)
        },

        # In case of during the data generation process, estimation, or evaluation.
        error = function(e) {
            # Return `NA` (i.e., a missing value).
            return(NA)
        }
    )

    # Return the result.
    return(output)
 }

 # Run the simulation function for several sample sizes.
 run_simulation(50)
 run_simulation(100)
 run_simulation(150)


 # With this simulation function in hand, we can now run the simulation for
 # multiple sample sizes. Suppose, we are interested in evaluating the model fit
 # for sample sizes ranging from 50 to 200 in increments of 5. We can obtain
 # those sample sizes as indicated below.

 # Define a range of sample sizes.
 sample_sizes <- seq(50, 200, by = 5)

 # Print the sample sizes.
 print(sample_sizes)

 # In general, it is a good practice to run the simulation multiple times for
 # each sample size. This is because our simulation process is stochastic (i.e.,
 # our generated data is a sample from a random process). Therefore, the model
 # fit outcome is subject to random fluctuations. By running the simulation
 # multiple times, we can obtain a more stable estimate of the model fit for each
 # sample size. In this case, say we run the simulation 100 times for each sample
 # size.

 # Define the number of replications.
 replications <- 100

 # While at it, let us also define a progress bar to gauge the progress of the
 # simulation. This may come in handy, especially when running simulations that
 # take a long time to complete.

 # Define a progress bar.
 bar <- progress_bar$new(total = length(sample_sizes) * replications)

 # Run the simulation for each sample size selected
 results <- sapply(sample_sizes, function(sample_size) {
    # For each sample size, replicate the simulation.
    result <- replicate(
        # How many times to replicate the simulation.
        n = replications,

        # The expression to replicate (i.e., our simulation function)
        expr = {
            # Increment the progress bar.
            bar$tick()

            # Run the simulation function.
            run_simulation(sample_size)
        }
    )
    # Return the result.
    return(result)
 })

 # Once we are done with the simulation, also remove the progress bar.
 rm(bar)

 # Add the column names to the results for clarity.
 colnames(results) <- sample_sizes

 # Add the row names to the results for clarity.
 rownames(results) <- 1:replications

 # If we print the results, we can see that we have a matrix where each row
 # corresponds to a replication and each column corresponds to a sample size. The
 # values in the matrix are the results of the model fit evaluation according to
 # our heuristic. If we go ahead and calculate the means of each column, we can
 # obtain the proportion of good model fit for each sample size.

 # Take the column means.
 results_means <- colMeans(results, na.rm = TRUE)

 # Finally, we can wrap up the simulation by plotting the results. We can plot the
 # proportion of good model fit as a function of the sample size. We can also add
 # a horizontal line at 0.8 to indicate some threshold of interest.

 # Prepare the data.
 data <- data.frame(
    sample_size = sample_sizes,
    proportion_good_fit = results_means
 )

 # Plot the results.
 ggplot(data, aes(x = sample_size, y = proportion_good_fit)) +
    # Plot the line.
    geom_line(
        color = "gray",
        size = 1.5
    ) +

    # Add vertical lines from the points to each sample size on the x-axis.
    geom_segment(
        aes(xend = sample_size, yend = 0),
        color = "black",
        linetype = "dotted",
        size = 0.3
    ) +

    # Add the point values.
    geom_point(
        size = 2.5
    ) +

    # Add a horizontal line at 0.8.
    geom_hline(
        yintercept = 0.8,
        linetype = "dashed",
        color = "darkred"
    ) +

    # Restrict the y-axis.
    scale_y_continuous(
        limits = c(0, 1),
        breaks = seq(0, 1, by = 0.1)
    ) +

    # Add x-axis labels.
    scale_x_continuous(
        breaks = sample_sizes
    ) +

    # Lab names.
    labs(
        title = paste0("Proportion of \"Good Fit\" by Sample Size (based on ", replications, " replications)"),
        x = "Sample Size",
        y = "Proportion of Good Fit"
    ) +

    # Set the main theme.
    theme_bw() +

    # Adjust the theme.
    theme(
        # Increase font size of the labels.
        axis.text = element_text(size = 12),
        axis.title = element_text(size = 14),

        # Hide major grid.
        panel.grid.major = element_blank(),

        # Rotate the x-axis labels.
        axis.text.x = element_text(angle = 45, hjust = 1),

        # Add some space between the plot and the labels.
        axis.title.y = element_text(
            margin = margin(t = 0, r = 5, b = 0, l = 0)
        ),
        axis.title.x = element_text(
            margin = margin(t = 15, r = 0, b = 0, l = 0)
        )
    )

 # Through this document, we have discussed the general workflow for a Structural
 # Equation Modeling (SEM) simulation, with an accent on sample size analysis. We
 # have also discussed the importance of defining what "good model fit" means,
 # and how to evaluate the model fit based on the heuristic we defined. We have
 # also discussed the importance of replicating our simulation to obtain a more
 # stable estimate of the model fit. Finally, we saw how to plot the results of
 # the simulation to visualize the impact of sample size on the model fit. Of
 # course, we only scratched the surface here, and there are many more aspects of
 # SEM simulations that we could explore, as well as ways to make these
 # simulations execute faster (e.g., using parallelization via the `parabar`
 # package). Additionally, for another interesting approach to calculating sample
 # requirements in SEM, readers may also consider consulting the paper Jak et al.
 # (2021). Nevertheless, I hope this document can serve as a gentle starting
 # point for anyone interested in SEM simulations.


 # References
 #
 # Jak, S., Jorgensen, T. D., Verdam, M. G. E., Oort, F. J., & Elffers, L.
 # (2021). Analytical power calculations for structural equation modeling: A
 # tutorial and Shiny app. Behavior Research Methods, 53(4), 1385–1406.
 # https://doi.org/10.3758/s13428-020-01479-0
 #
 # Rosseel, Y. (2012). Lavaan: An R Package for Structural Equation Modeling.
 # Journal of Statistical Software, 48(2). https://doi.org/10.18637/jss.v048.i02
	# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
	# Example of simple SEM simulation #
	# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
	# #
	# Author(s): #
	# - Mihai Constantin ([email protected]) #
	# #
	# File description: #
	# - This file contains the general workflow for a Structural Equation #
	# Modeling (SEM) simulation, with an accent on sample size analysis. #
	# #
	# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
	# #
	# License: #
	# - The contents of this file (i.e., including the comments, source #
	# code, and other text) by Mihai Constantin are licensed under the #
	# CC BY 4.0 license. See https://creativecommons.org/licenses/by/4.0. #
	# #
	# You can provide attribution by citing as follows: #
	# #
	# Constantin, M. (2025, March 5). Example of simple SEM simulation in `R` #
	# with an accent on the sample size. #
	# https://gist.github.com/mihaiconstantin/bd22dbb482457caabe2b0299d188e680 #
	# #
	# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #


	# Start by loading the necessary libraries.
	library(lavaan)
	library(ggplot2)
	library(progress)


	# Before starting any simulation, it helps to think of what the goal is. What do
	# we want to achieve with the simulation? In this case, we want to evaluate the
	# impact of sample size on the model fit. Note that using causal language is
	# warranted, since that is exactly what a simulation is, an experiment where we
	# manipulate the sample size to see how it affects the model fit. Therefore, we
	# our dependent variable is the model fit, and the independent variable is the
	# sample size.

	# That being said, let's start by determining what "good model fit" means. This
	# is a subjective decision, but it is important to make it explicit before
	# running the simulation. For instance, we could say that a good model fit is a
	# heuristic where the CFI (Comparative Fit Index) is greater than 0.95 and the
	# RMSEA (Root Mean Square Error of Approximation) is less than 0.05. It is
	# important here to let the research question be your guide, and the literature
	# your support.

	# Let us define a function that evaluates the fit based on the CFI and RMSEA.
	# This function takes in the `fit_measures` object obtained from the
	# `lavaan::fitMeasures` function and return one if the model has good fit
	# according to our heuristic, and zero otherwise.
	has_good_fit <- function(fit_measures) {
	# Extract the CFI.
	cfi <- fit_measures["cfi"]

	# Extract the RMSEA.
	rmsea <- fit_measures["rmsea"]

	# If both CFI and RMSEA are good.
	if (cfi >= 0.95 && rmsea < 0.05) {
	# Return `1`.
	return(1)
	}

	# Otherwise, return `0`.
	return(0)
	}

	# At this point, we are ready to specify the model we want to fit. For this
	# example, we will use a simple model full SEM with four latent variables. For
	# the measurement part of the model, we have four latent variables, each measured
	# by four indicators. The structural part of the model consists of two latent
	# variables, with the third regressed on the first two, and the fourth regressed
	# on the first three. The model is specified below.

	# Specify the model.
	model <- "
	# Measurement model.

	# Define the first latent variable.
	eta1 =~ y1 + y2 + y3 + y4

	# Define the second latent variable.
	eta2 =~ y5 + y6 + y7 + y8

	# Define the third latent variable.
	eta3 =~ y9 + y10 + y11 + y12

	# Define the fourth latent variable.
	eta4 =~ y13 + y14 + y15 + y16

	# Structural model.

	# Latent regressions.
	eta3 ~ eta1 + eta2
	eta4 ~ eta1 + eta2 + eta3
	"

	# Arguably the hardest part of the SEM simulation is specifying the population
	# model. You can think of the population model as the "true" model that you are
	# trying to estimate. In other words, if you had collected an infinite sample
	# size, then the model you would fit would be the population model. What makes
	# specifying the population model difficult is that you need to know the true
	# values of the parameters. In practice, this is rarely the case, and the
	# population model is often a best guess based on theory and previous research.

	# Specify the population model.
	model_population <- "
	# Measurement model.

	# Define the first latent variable.
	eta1 =~ 1 * y1 + 0.70 * y2 + 0.80 * y3 + 0.65 * y4

	# Define the second latent variable.
	eta2 =~ 1 * y5 + 0.65 * y6 + 0.75 * y7 + 0.80 * y8

	# Define the third latent variable.
	eta3 =~ 1 * y9 + 0.75 * y10 + 0.70 * y11 + 0.65 * y12

	# Define the fourth latent variable.
	eta4 =~ 1 * y13 + 0.75 * y14 + 0.65 * y15 + 0.80 * y16

	# Residual variances.
	y1 ~~ y1
	y2 ~~ (1 - 0.70^2) * y2
	y3 ~~ (1 - 0.80^2) * y3
	y4 ~~ (1 - 0.65^2) * y4

	y5 ~~ y5
	y6 ~~ (1 - 0.65^2) * y6
	y7 ~~ (1 - 0.75^2) * y7
	y8 ~~ (1 - 0.80^2) * y8

	y9 ~~ y9
	y10 ~~ (1 - 0.75^2) * y10
	y11 ~~ (1 - 0.70^2) * y11
	y12 ~~ (1 - 0.65^2) * y12

	y13 ~~ y13
	y14 ~~ (1 - 0.75^2) * y14
	y15 ~~ (1 - 0.65^2) * y15
	y16 ~~ (1 - 0.80^2) * y16

	# Structural model.

	# Latent regressions.
	eta3 ~ 0.50 * eta1 + 0.30 * eta2
	eta4 ~ 0.40 * eta1 + 0.25 * eta2 + 0.35 * eta3

	# Latent variances.

	# Fix exogenous factor variances to 1.
	eta1 ~~ 1 * eta1
	eta2 ~~ 1 * eta2

	# Fix (i.e., or freely estimate the endogenous factor variances if you may).
	eta3 ~~ 1 * eta3
	eta4 ~~ 1 * eta4

	# Latent covariances.
	eta1 ~~ 0.30 * eta2
	"

	# We need two more ingredients before we can inquire about the sample size.
	# First, we need to be able to generate data from our population model. In
	# simple terms, we need a way to construct the (population) model-implied
	# covariance matrix. Then, we can use this covariance matrix to draw samples
	# from a multivariate normal distribution. In less simple terms, recall that our
	# model for for the vector `y` of random variables is:
	#
	# y = Λη + ε,
	#
	# where η is further modeled as:
	#
	# η = Bη + ζ
	# = (I - B)⁻¹ ζ,
	#
	# with assumptions:
	#
	# y ~ 𝒩(0, Σ)
	# ζ ~ 𝒩(0, Ψ)
	# ε ~ 𝒩(0, Θ).
	#
	# Then, our model-implied covariance matrix is the variance of `y`, which is
	# obtained by:
	#
	# Σ = Var(y)
	# = Var(Λη + ε)
	# = Var(Λη) + Var(ε)
	# = Λ Var(η) Λᵀ + Θ
	# = Λ Var((I - B)⁻¹ ζ) Λᵀ + Θ
	# = Λ (I - B)⁻¹ Var(ζ) (I - B)⁻¹ᵀ Λᵀ + Θ
	# = Λ (I - B)⁻¹ Ψ (I - B)⁻¹ᵀ Λᵀ + Θ.
	#
	# Therefore, through the `lavaan` syntax we first indicated which specific
	# parameters in the matrices Λ, Ψ, and Θ we want to estimate. Then, using
	# `lavaan` syntax, we selected values for those parameters in line with our
	# hypothesis about what the values of those parameters are in the population.
	# Please note that this latter step is likely the part we are most wrong about.
	# Finally, using the values will pick, we can calculate the model-implied
	# covariance matrix Σ as shown above, the we can use it to take samples from a
	# multivariate normal distribution. All these steps are very conveniently
	# abstracted away in the `lavaan::simulateData` function (see Rosseel 2012).
	# Suppose we are interested in generating 1000 samples from the population
	# model. We can do this as indicated below.

	# Define the sample size.
	sample_size <- 200

	# Generate data from the population model.
	data <- simulateData(model = model_population, sample.nobs = sample_size)

	# View the first few rows of the data.
	head(data)

	# The second ingredient we need is a way we fit our model to the data. In simple
	# terms, the process of fitting a SEM model is a numerical optimization problem.
	# The gist (i.e., pun intended) is that we select different values for our model
	# parameters, recompute the model-implied covariance matrix Σ, and compare it to
	# the observed covariance matrix S (i.e., the one observed in the data). We then
	# adjust our model parameters to minimize the discrepancy between Σ and S. In
	# most cases, when using Maximum Likelihood Estimation, this discrepancy is
	# operationalized by a fit function defined as follows:
	#
	# F_ML = trace(S Σ(θ)⁻¹) - ln \|S Σ(θ)⁻¹\| - p,
	#
	# where:
	# • trace(·) calculates the sum of the main diagonal of a matrix
	# • ln is the natural logarithm and \| · \| indicates the matrix determinant
	# • S is the sample (observed) covariance matrix
	# • Σ(θ) is the model-implied covariance matrix
	# • θ are the model parameters used to calculate Σ(θ)
	# • p is the number of observed variables
	#
	# This expression is derived based on the assumption of multivariate normality:
	#
	# y ~ 𝒩(0, Σ).
	#
	# In the less useful case that our model fits the data perfectly (i.e., when we
	# have zero degrees of freedom), we observe that S = Σ(θ) and the entire
	# expression above simplifies to zero. In practice, we are not interested in
	# such saturated models, but rather in models that are parsimonious and
	# generalizable.

	# This entire process of "trying out" different values for the model parameters
	# are calculating the fit function is repeated many times until we are
	# "satisfied" with the outcome (e.g., when further changes no longer yield
	# meaningful improvements). This process is in fact a numerical optimization,
	# problem and it is, again, conveniently abstracted away in the `lavaan::sem`
	# function. We can fit our model to the data as indicated below.

	# Fit the model to the simulated data.
	fit <- sem(model, data = data)

	# At this point, we can calculate a plethora of fit indices to evaluate how well
	# our model fits the data. The `lavaan::fitMeasures` function provides a
	# convenient way to calculate these fit indices. For instance, we can print the
	# estimated model parameters and inspect the fit indices as indicated below.

	# Summarize the model fit.
	summary(fit, fit.measures = TRUE, standardized = TRUE)

	# Provided that the sample size was "sufficiently large", we should expect to
	# see that the estimated model parameters are close to the true values specified
	# in the population model.

	# Let us also evaluate our model fit based on the heuristic we defined earlier.
	# We can do this by passing the fit measures to the `has_good_fit` function we
	# defined earlier.

	# First, extract the fit measures.
	fit_measures <- fitMeasures(fit)

	# Evaluate the model fit.
	has_good_fit(fit_measures)

	# We can already see from the `summary` output that the model fit is good.
	# Therefore, we can reasonably expect the `has_good_fit` function to return `1`.

	# Let us now try to repeat the process of another sample size. It would be a bit
	# error-prone to do this manually, so we can leverage `R` to automate this
	# process. To this, we can define a function. In general, you think of a
	# function as a process that converts an input into an output. In our case, the
	# input is the sample size, and the output is the result of the model fit
	# evaluation according to our heuristic. The process is the exact steps we
	# discussed above. The function does not change the process in any way, it
	# merely makes it more convenient to repeat.

	# Note. For the sake of clarity, we assume some arguments are defined in the
	# global environment. This is not a good practice, and you should always aim to
	# pass all necessary arguments to the function explicitly. However, this is but
	# a demonstration aimed for educational purposes.

	# Define a function to run the simulation for a single sample size.
	run_simulation <- function(sample_size) {
	# Attempt to fit the model.
	output <- tryCatch(
	expr = {
	# Simulate data.
	data <- lavaan::simulateData(model_population, sample.nobs = sample_size)

	# Fit the model.
	fit <- sem(model, data = data)

	# Extract the fit measures.
	fit_measures <- fitMeasures(fit)

	# Evaluate the model fit.
	result <- has_good_fit(fit_measures)

	# Return the result.
	return(result)
	},

	# In case of during the data generation process, estimation, or evaluation.
	error = function(e) {
	# Return `NA` (i.e., a missing value).
	return(NA)
	}
	)

	# Return the result.
	return(output)
	}

	# Run the simulation function for several sample sizes.
	run_simulation(50)
	run_simulation(100)
	run_simulation(150)


	# With this simulation function in hand, we can now run the simulation for
	# multiple sample sizes. Suppose, we are interested in evaluating the model fit
	# for sample sizes ranging from 50 to 200 in increments of 5. We can obtain
	# those sample sizes as indicated below.

	# Define a range of sample sizes.
	sample_sizes <- seq(50, 200, by = 5)

	# Print the sample sizes.
	print(sample_sizes)

	# In general, it is a good practice to run the simulation multiple times for
	# each sample size. This is because our simulation process is stochastic (i.e.,
	# our generated data is a sample from a random process). Therefore, the model
	# fit outcome is subject to random fluctuations. By running the simulation
	# multiple times, we can obtain a more stable estimate of the model fit for each
	# sample size. In this case, say we run the simulation 100 times for each sample
	# size.

	# Define the number of replications.
	replications <- 100

	# While at it, let us also define a progress bar to gauge the progress of the
	# simulation. This may come in handy, especially when running simulations that
	# take a long time to complete.

	# Define a progress bar.
	bar <- progress_bar$new(total = length(sample_sizes) * replications)

	# Run the simulation for each sample size selected
	results <- sapply(sample_sizes, function(sample_size) {
	# For each sample size, replicate the simulation.
	result <- replicate(
	# How many times to replicate the simulation.
	n = replications,

	# The expression to replicate (i.e., our simulation function)
	expr = {
	# Increment the progress bar.
	bar$tick()

	# Run the simulation function.
	run_simulation(sample_size)
	}
	)
	# Return the result.
	return(result)
	})

	# Once we are done with the simulation, also remove the progress bar.
	rm(bar)

	# Add the column names to the results for clarity.
	colnames(results) <- sample_sizes

	# Add the row names to the results for clarity.
	rownames(results) <- 1:replications

	# If we print the results, we can see that we have a matrix where each row
	# corresponds to a replication and each column corresponds to a sample size. The
	# values in the matrix are the results of the model fit evaluation according to
	# our heuristic. If we go ahead and calculate the means of each column, we can
	# obtain the proportion of good model fit for each sample size.

	# Take the column means.
	results_means <- colMeans(results, na.rm = TRUE)

	# Finally, we can wrap up the simulation by plotting the results. We can plot the
	# proportion of good model fit as a function of the sample size. We can also add
	# a horizontal line at 0.8 to indicate some threshold of interest.

	# Prepare the data.
	data <- data.frame(
	sample_size = sample_sizes,
	proportion_good_fit = results_means
	)

	# Plot the results.
	ggplot(data, aes(x = sample_size, y = proportion_good_fit)) +
	# Plot the line.
	geom_line(
	color = "gray",
	size = 1.5
	) +

	# Add vertical lines from the points to each sample size on the x-axis.
	geom_segment(
	aes(xend = sample_size, yend = 0),
	color = "black",
	linetype = "dotted",
	size = 0.3
	) +

	# Add the point values.
	geom_point(
	size = 2.5
	) +

	# Add a horizontal line at 0.8.
	geom_hline(
	yintercept = 0.8,
	linetype = "dashed",
	color = "darkred"
	) +

	# Restrict the y-axis.
	scale_y_continuous(
	limits = c(0, 1),
	breaks = seq(0, 1, by = 0.1)
	) +

	# Add x-axis labels.
	scale_x_continuous(
	breaks = sample_sizes
	) +

	# Lab names.
	labs(
	title = paste0("Proportion of \"Good Fit\" by Sample Size (based on ", replications, " replications)"),
	x = "Sample Size",
	y = "Proportion of Good Fit"
	) +

	# Set the main theme.
	theme_bw() +

	# Adjust the theme.
	theme(
	# Increase font size of the labels.
	axis.text = element_text(size = 12),
	axis.title = element_text(size = 14),

	# Hide major grid.
	panel.grid.major = element_blank(),

	# Rotate the x-axis labels.
	axis.text.x = element_text(angle = 45, hjust = 1),

	# Add some space between the plot and the labels.
	axis.title.y = element_text(
	margin = margin(t = 0, r = 5, b = 0, l = 0)
	),
	axis.title.x = element_text(
	margin = margin(t = 15, r = 0, b = 0, l = 0)
	)
	)

	# Through this document, we have discussed the general workflow for a Structural
	# Equation Modeling (SEM) simulation, with an accent on sample size analysis. We
	# have also discussed the importance of defining what "good model fit" means,
	# and how to evaluate the model fit based on the heuristic we defined. We have
	# also discussed the importance of replicating our simulation to obtain a more
	# stable estimate of the model fit. Finally, we saw how to plot the results of
	# the simulation to visualize the impact of sample size on the model fit. Of
	# course, we only scratched the surface here, and there are many more aspects of
	# SEM simulations that we could explore, as well as ways to make these
	# simulations execute faster (e.g., using parallelization via the `parabar`
	# package). Additionally, for another interesting approach to calculating sample
	# requirements in SEM, readers may also consider consulting the paper Jak et al.
	# (2021). Nevertheless, I hope this document can serve as a gentle starting
	# point for anyone interested in SEM simulations.


	# References
	#
	# Jak, S., Jorgensen, T. D., Verdam, M. G. E., Oort, F. J., & Elffers, L.
	# (2021). Analytical power calculations for structural equation modeling: A
	# tutorial and Shiny app. Behavior Research Methods, 53(4), 1385–1406.
	# https://doi.org/10.3758/s13428-020-01479-0
	#
	# Rosseel, Y. (2012). Lavaan: An R Package for Structural Equation Modeling.
	# Journal of Statistical Software, 48(2). https://doi.org/10.18637/jss.v048.i02