Although finite mixture models have received considerable attention, particularly in the social and behavioral sciences, an alternative method for creating homogeneous groups, structural equation model trees (Brandmaier, von Oertzen, McArdle, & Lindenberger, 2013), is a recent development that has received much less application and consideration. It is our aim to compare and contrast these methods for uncovering sample heterogeneity. We illustrate the use of these methods with longitudinal reading achievement data collected as part of the Early Childhood Longitudinal Study–Kindergarten Cohort. We present the use of structural equation model trees as an alternative framework that does not assume the classes are latent and uses observed covariates to derive their structure. We consider these methods as complementary and discuss their respective strengths and limitations for creating homogeneous groups.
This paper proposes a novel exploratory approach for assessing how the effects of level-2 predictors differ across level-1 units. Multilevel regression mixture models are used to identify latent classes at level-1 that differ in the effect of one or more level-2 predictors. Monte Carlo simulations are used to demonstrate the approach with different sample sizes and to demonstrate the consequences of constraining 1 of the random effects to zero. An application of the method to evaluate heterogeneity in the effects of classroom practices on students is used to show the types of research questions which can be answered with this method and the issues faced when estimating multilevel regression mixtures.
We investigate a method to estimate the combined effect of multiple continuous/ordinal mediators on a binary outcome: 1) fit a structural equation model with probit link for the outcome and identity/probit link for continuous/ordinal mediators, 2) predict potential outcome probabilities, and 3) compute natural direct and indirect effects. Step 2 involves rescaling the latent continuous variable underlying the outcome to address residual mediator variance/covariance. We evaluate the estimation of risk-difference- and risk-ratio-based effects (RDs, RRs) using the ML, WLSMV and Bayes estimators in Mplus. Across most variations in path-coefficient and mediator-residual-correlation signs and strengths, and confounding situations investigated, the method performs well with all estimators, but favors ML/WLSMV for RDs with continuous mediators, and Bayes for RRs with ordinal mediators. Bayes outperforms WLSMV/ML regardless of mediator type when estimating RRs with small potential outcome probabilities and in two other special cases. An adolescent alcohol prevention study is used for illustration.
The scientific literature consistently supports a negative relationship between adolescent depression and educational achievement, but we are certainly less sure on the causal determinants for this robust association. In this paper we present multivariate data from a longitudinal cohort-sequential study of high school students in Hawai‘i (followingMcArdle, 2009;McArdle, Johnson, Hishinuma, Miyamoto, & Andrade, 2001). We first describe the full set of data on academic achievements and self-reported depression. We then carry out and present a progression of analyses in an effort to determine the accuracy, size, and direction of the dynamic relationships among depression and academic achievement, including gender and ethnic group differences. Weapplythree recently available forms of longitudinal data analysis: (1)Dealing with Incomplete Data-- We apply these methods to cohort-sequential data with relatively large blocks of data which are incomplete for a variety of reasons (Little & Rubin, 1987;McArdle & Hamagami, 1992). (2)Ordinal Measurement Models(Muthén & Muthén, 2006) -- We use a variety of statistical and psychometric measurement models, including ordinal measurement models to help clarify the strongest patterns of influence. (3)Dynamic Structural Equation Models(DSEMs;McArdle, 2009). We found the DSEM approach taken here was viable for a large amount of data, the assumption of an invariant metric over time was reasonable for ordinal estimates, and there were very few group differences in dynamic systems. We conclude that our dynamic evidence suggests that depression affects academic achievement, and not the other way around. We further discuss the methodological implications of the study.
The latent growth curve modeling (LGCM) approach has been increasingly utilized to investigate longitudinal mediation. However, little is known about the accuracy of the estimates and statistical power when mediation is evaluated in the LGCM framework. A simulation study was conducted to address these issues under various conditions including sample size, effect size of mediated effect, number of measurement occasions, andR2of measured variables. In general, the results showed that relatively large samples were needed to accurately estimate the mediated effects and to have adequate statistical power, when testing mediation in the LGCM framework. Guidelines for designing studies to examine longitudinal mediation and ways to improve the accuracy of the estimates and statistical power were discussed.
A new method is proposed that extends the use of regularization in both lasso and ridge regression to structural equation models. The method is termed regularized structural equation modeling (RegSEM). RegSEM penalizes specific parameters in structural equation models, with the goal of creating easier to understand and simpler models. Although regularization has gained wide adoption in regression, very little has transferred to models with latent variables. By adding penalties to specific parameters in a structural equation model, researchers have a high level of flexibility in reducing model complexity, overcoming poor fitting models, and the creation of models that are more likely to generalize to new samples. The proposed method was evaluated through a simulation study, two illustrative examples involving a measurement model, and one empirical example involving the structural part of the model to demonstrate RegSEM’s utility.
In a recent article,Castro-Schilo, Widaman, and Grimm (2013)compared different approaches for relating multitrait-multimethod (MTMM) data to external variables. Castro-Schilo et al. reported that estimated associations with external variables were in part biased when either the Correlated Traits-Correlated Uniqueness (CT-CU) or Correlated Traits-Correlated (Methods – 1) CT-C(M – 1)] models were fit to data generated from the Correlated Traits-Correlated Methods (CT-CM) model, whereas the data-generating CT-CM model accurately reproduced these associations. Castro-Schilo et al. argued that the CT-CM model adequately represents the data-generating mechanism in MTMM studies, whereas the CT-CU and CT-C(M – 1) models do not fully represent the MTMM structure. In this comment, we question whether the CT-CM model is more plausible as a data-generating model for MTMM data than the CT-C(M – 1) model. We show that the CT-C(M – 1) model can be formulated as a reparameterization of a basic MTMM true score model that leads to a meaningful and parsimonious representation of MTMM data. We advocate the use CFA-MTMM models in which latent trait, method, and error variables are explicitly and constructively defined based on psychometric theory.
Stage-sequential (or multiphase) growth mixture models are useful for delineating potentially different growth processes across multiple phases over time and for determining whether latent subgroups exist within a population. These models are increasingly important as social behavioral scientists are interested in better understanding change processes across distinctively different phases, such as before and after an intervention. One of the less understood issues related to the use of growth mixture models is how to decide on the optimal number of latent classes. The performance of several traditionally used information criteria for determining the number of classes is examined through a Monte Carlo simulation study in single- and multi-phase growth mixture models. For thorough examination, the simulation was carried out in two perspectives: the models and the factors. The simulation in terms of the models was carried out to see the overall performance of the information criteria within and across the models, while the simulation in terms of the factors was carried out to see the effect of each simulation factor on the performance of the information criteria holding the other factors constant. The findings not only support that sample size adjusted BIC (ADBIC) would be a good choice under more realistic conditions, such as low class separation, smaller sample size, and/or missing data, but also increase understanding of the performance of information criteria in single- and multi-phase growth mixture models.
A two-stage procedure for estimation and testing of observed measure correlations in the presence of missing data is discussed. The approach uses maximum likelihood for estimation and the false discovery rate concept for correlation testing. The method can be utilized in initial exploration oriented empirical studies with missing data, where it is of interest to estimate manifest variable interrelationship indexes and test hypotheses about their population values. The procedure is applicable also with violations of the underlying missing at random assumption, via inclusion of auxiliary variables. The outlined approach is illustrated with data from an aging research study.
Despite recent methodological advances in latent class analysis (LCA) and a rapid increase in its application in behavioral research, complex research questions that include latent class variables often must be addressed by classifying individuals into latent classes and treating class membership as known in a subsequent analysis. Traditional approaches to classifying individuals based on posterior probabilities are known to produce attenuated estimates in the analytic model. We propose the use of a more inclusive LCA to generate posterior probabilities; this LCA includes additional variables present in the analytic model. A motivating empirical demonstration is presented, followed by a simulation study to assess the performance of the proposed strategy. Results show that with sufficient measurement quality or sample size, the proposed strategy reduces or eliminates bias.
In longitudinal research, interest often centers on individual trajectories of change over time. When there is missing data, a concern is whether data are systematically missing as a function of the individual trajectories. Such a missing data process, termedrandom coefficient-dependent missingness, is statistically non-ignorable and can bias parameter estimates obtained from conventional growth models that assume missing data are missing at random. This paper describes a shared-parameter mixture model (SPMM) for testing the sensitivity of growth model parameter estimates to a random coefficient-dependent missingness mechanism. Simulations show that the SPMM recovers trajectory estimates as well as or better than a standard growth model across a range of missing data conditions. The paper concludes with practical advice for longitudinal data analysts.
Little research has examined factors influencing statistical power to detect the correct number of latent classes using latent profile analysis (LPA). This simulation study examined power related to inter-class distance between latent classes given true number of classes, sample size, and number of indicators. Seven model selection methods were evaluated. None had adequate power to select the correct number of classes with a small (Cohen’sd= .2) or medium (d= .5) degree of separation. With a very large degree of separation (d= 1.5), the Lo-Mendell-Rubin test (LMR), adjusted LMR, bootstrap likelihood-ratio test, BIC, and sample-size adjusted BIC were good at selecting the correct number of classes. However, with a large degree of separation (d= .8), power depended on number of indicators and sample size. The AIC and entropy poorly selected the correct number of classes, regardless of degree of separation, number of indicators, or sample size.
The factor mixture model (FMM) uses a hybrid of both categorical and continuous latentvariables. The FMM is a good model for the underlying structure of psychopathology because the useof both categorical and continuous latent variables allows the structure to be simultaneouslycategorical and dimensional. This is useful because both diagnostic class membership and the rangeof severity within and across diagnostic classes can be modeled concurrently. While theconceptualization of the FMM has been explained in the literature, the use of the FMM is still notprevalent. One reason is that there is little research about how such models should be applied inpractice and, once a well fitting model is obtained, how it should be interpreted. In this paper,the FMM will be explored by studying a real data example on conduct disorder. By exploring thisexample, this paper aims to explain the different formulations of the FMM, the various steps inbuilding a FMM, as well as how to decide between a FMM and alternative models.
Missing data are common in studies that rely on multiple informant data to evaluate relationships among variables for distinguishable individuals clustered within groups. Estimation of structural equation models using raw data allows for incomplete data, and so all groups may be retained even if only one member of a group contributes data. Statistical inference is based on the assumption that data are missing completely at random or missing at random. Importantly, whether or not data are missing is assumed to be independent of the missing data. A saturated correlates model that incorporates correlates of the missingness or the missing data into an analysis and multiple imputation that may also use such correlates offer advantages over the standard implementation of SEM when data are not missing at random because these approaches may result in a data analysis problem for which the missingness is ignorable. This paper considers these approaches in an analysis of family data to assess the sensitivity of parameter estimates to assumptions about missing data, a strategy that may be easily implemented using SEM software.
First order latent growth curve models (FGMs) estimate change based on a single observed variable and are widely used in longitudinal research. Despite significant advantages, second order latent growth curve models (SGMs), which use multiple indicators, are rarely used in practice, and not all aspects of these models are widely understood. In this article, our goal is to contribute to a deeper understanding of theoretical and practical differences between FGMs and SGMs. We define the latent variables in FGMs and SGMs explicitly on the basis of latent state-trait (LST) theory and discuss insights that arise from this approach. We show that FGMs imply a strict trait-like conception of the construct under study, whereas SGMs allow for both trait and state components. Based on a simulation study and empirical applications to the CES-D depression scale (Radloff, 1977) we illustrate that, as an important practical consequence, FGMs yield biased reliability estimates whenever constructs contain state components, whereas reliability estimates based on SGMs were found to be accurate. Implications of the state-trait distinction for the measurement of change via latent growth curve models are discussed.
Selecting the number of different classes which will be assumed to exist in the population is an important step in latent class analysis (LCA). The bootstrap likelihood ratio test (BLRT) provides a data-driven way to evaluate the relative adequacy of a (K?1)-class model compared to aK-class model. However, very little is known about how to predict the power or the required sample size for the BLRT in LCA. Based on extensive Monte Carlo simulations, we provide practical effect size measures and power curves which can be used to predict power for the BLRT in LCA given a proposed sample size and a set of hypothesized population parameters. Estimated power curves and tables provide guidance for researchers wishing to size a study to have sufficient power to detect hypothesized underlying latent classes.
The analysis of longitudinal data collected from non-exchangeable dyads presents a challenge for applied researchers for various reasons. This paper introduces the Dyadic Curve-of-Factors Model (D-COFM) which extends the Curve-of-Factors Model (COFM) proposed byMcArdle (1988)for use with non-exchangeable dyadic data. The D-COFM overcomes problems with modeling composite scores across time and instead permits examination of the growth in latent constructs over time. The D-COFM also appropriately models the interdependency among non-exchangeable dyads. Different parameterizations of the D-COFM are illustrated and discussed using a real dataset to aid applied researchers when analyzing dyadic longitudinal data.
The integration of modern methods for causal inference with latent class analysis (LCA) allows social, behavioral, and health researchers to address important questions about the determinants of latent class membership. In the present article, two propensity score techniques, matching and inverse propensity weighting, are demonstrated for conducting causal inference in LCA. The different causal questions that can be addressed with these techniques are carefully delineated. An empirical analysis based on data from the National Longitudinal Survey of Youth 1979 is presented, where college enrollment is examined as the exposure (i.e., treatment) variable and its causal effect on adult substance use latent class membership is estimated. A step-by-step procedure for conducting causal inference in LCA, including multiple imputation of missing data on the confounders, exposure variable, and multivariate outcome, is included. Sample syntax for carrying out the analysis using SAS and R is given in anappendix.
Although prediction of class membership from observed variables in latent class analysis is well understood, predicting an observed distal outcome from latent class membership is more complicated. A flexible model-based approach is proposed to empirically derive and summarize the class-dependent density functions of distal outcomes with categorical, continuous, or count distributions. A Monte Carlo simulation study is conducted to compare the performance of the new technique to two commonly used classify-analyze techniques: maximum-probability assignment and multiple pseudo-class draws. Simulation results show that the model-based approach produces substantially less biased estimates of the effect compared to either classify-analyze technique, particularly when the association between the latent class variable and the distal outcome is strong. In addition, we show that only the model-based approach is consistent. The approach is demonstrated empirically: latent classes of adolescent depression are used to predict smoking, grades, and delinquency. SAS syntax for implementing this approach using PROC LCA and a corresponding macro are provided.
In recent years the use of the Latent Curve Model (LCM) among researchers in social sciences has increased noticeably, probably thanks to contemporary software developments and to the availability of specialized literature. Extensions of the LCM, like the the Latent Change Score Model (LCSM), have also increased in popularity. At the same time, the R statistical language and environment, which is open source and runs on several operating systems, is becoming a leading software for applied statistics. We show how to estimate both the LCM and LCSM with the sem, lavaan, and OpenMx packages of the R software. We also illustrate how to read in, summarize, and plot data prior to analyses. Examples are provided on data previously illustrated byFerrer, Hamagami, & McArdle, 2004. The data and all scripts used here are available on the first author’s website.