|
|
Overview
Partial least squares (PLS) regression/path analysis is thus an alternative to OLS regression, canonical correlation, or structural equation modeling (SEM) for analysis of systems of independent and response variables. In fact, PLS is sometimes called "component-based SEM," in contrast to the usual covariance-based structural equation modeling. PLS is a predictive technique which can handle many independent variables, even when predictors display multicollinearity. Like canonical correlation or multivariate GLM, it can also relate the set of independent variables to a set of multiple dependent (response) variables. However, PLS is less than satisfactory as an explanatory technique because it is low in power to filter out variables of minor causal importance (Tobias, 1997: 1). The advantages of PLS include ability to model multiple dependents as well as multiple independents; ability to handle multicollinearity among the independents; robustness in the face of data noise and missing data; and creating independent latents directly on the basis of crossproducts involving the response variable(s), making for stronger predictions. Disadvantages of PLS include greater difficulty of interpreting the loadings of the independent latent variables (which are based on crossproduct relations with the response variables, not based as in common factor analysis on covariances among the manifest independents) and because the distributional properties of estimates are not known, the researcher cannot assess significance except through bootstrap induction. Overall, the mix of advantages and disadvantages means PLS is favored as a predictive technique and not as an interpretive technique, except for exploratory analysis as a prelude to an intepretive technique such as multiple linear regression or covariance-based structural equation modeling. Hinseler, Ringle, and Sinkovics (2009: 282) thus state, "PLS path modeling is recommended in an early stage of theoretical development in order to test and validate exploratory models." Though developed by Herman Wold (Wold, 1981, 1985) for econometrics, PLS first gained popularity in chemometric research and later industrial applications. It has since spread to research in education (ex., Campbell & Yates, 2011), marketing (ex., Albers, 2009, cites PLS as the method of choice in success factors marketing research), and the social sciences (ex., Jacobs et al., 2011). PLS may be implemented as a regression model, predicting one or more dependents from a set of one or more independents; or it can be implemented as a path model, akin to structural equation modeling. PLS is implemented as a regression model by SPSS and by SAS's PROC PLS. SmartPLS is the most prevalent implementation as a path model.
|
|
Note that PLS factors are not the same as the latent variables in common factor analysis in the usual covariance-based structural equation modeling (SEM). Where SEM is based on common (principal) factor analysis, PLS is based on principal component analysis (PCA; see the comparison in the section on types of factoring on the factor analysis page). One school of thought reserves the term "latent variable" for those created based on covariances, as in SEM, referring to PLS factors as "weighted composites." The term "composite" refers to the fact that PLS factors are estimated as exact linear combinations of their indicators. True latent variables in SEM, in contrast, are computed in a manner which reflects the covariation of their indicators (McDonald 1996). While a PLS path model of causal relations among composites may approximate a SEM path model of causal relations among latent variables (McDonald, 1996) the two are not equivalent and under certain circumstances may diverge considerably. Only when the PLS weight vector is proportional to the SEM common factor loading vector will a SEM and PLS factor be similar (see Schneeweiss 1993).
Categorical variable coding. Both nominal and ordinal variables are treated the same, as categorical variables, by SPSS algorithms. Dummy variable coding is used. For a categorical variable with c categories, the first is coded (1, 0, 0,...0), where the last 0 is for the cth category. The last category is coded (0, 0, 0, .... 1). In the PLS dialog, the researcher specifies which dummy variable representing desired reference category is to be omitted in the model.
When prompted at the start of the PLS run, click the "Define Variable Properties" button to obtain first a dialog letting the user enter the variables to be used, then proceed to the "Define Variable Properties" dialog, shown above. SPSS scans the first 200 (default) cases and makes estimates of the measurement level, classifying variables into nominal, ordinal, or scalar (interval or ratio). Symbols in front of variable names in the "Scanned variable list" on the left show the assigned measurement levels, though these initial assignments can be changed in the main dialog, using the drop-down menu for "Measurement Level". It is a good idea to check proper assignment of missing value codes and other settings in this dialog also. Clicking the "Help" button explains the many options available in the "Define Variable Properties" dialog.
The cross-validation coefficient, r2cv, is the percent of variance explained in the dependent variate by the predictions from the leave-one-out process (see Wakeling & Morris, 2005: 294). That is,
where RSS is the initial sum of squares for the dependent variable and PRESS is the PRESS statistic (discussed) below. Wakeling & Morris (2005: 298-300), using Monte Carlo simulation methods, have developed tables of critical values of r2cv for one-, two-, and three-dimensional models, for datasets with given numbers of rows and columns. Thus r2cv greater than the critical value may be taken as significant, and the researcher may select the model with the least number of dimensions with a significant cross-validation statistic as being the most parsimonious and therefore optimal model.
The more a factor explains of the variation in the Y variables, the more powerful it is apt to be in explaining the variation in a new sample of dependent values. The more a factor explains in the variation of the X variables, the more it well reflects the observed values of the set of independent variables.
| Proportion of Variance Explained | |||||
| Latent Factors | Statistics | ||||
| X Variance | Cumulative X Variance | Y Variance | Cumulative Y Variance (R-square) | Adjusted R-square | |
| 1 | .307 | .307 | .011 | .011 | .010 |
| 2 | .271 | .578 | .002 | .013 | .011 |
| 3 | .218 | .796 | .000 | .014 | .011 |
| 4 | .079 | .875 | 5.024E-5 | .014 | .010 |
| 5 | .125 | 1.000 | 1.875E-5 | .014 | .010 |
| Weights | |||||
| Variables | Latent Factors | ||||
| 1 | 2 | 3 | 4 | 5 | |
| [sex=Male] | .048 | .206 | .708 | .173 | -.457 |
| [race=White] | .297 | .528 | -.076 | .689 | .979 |
| [race=Black] | -.301 | -.472 | .238 | .688 | .978 |
| age | .463 | -.555 | -.482 | .154 | -.409 |
| prestg80 | .778 | -.524 | .459 | -.108 | .273 |
| [happy=Very Happy] | .113 | -.022 | .019 | .004 | .003 |
| [happy=Pretty Happy] | -.059 | .056 | .011 | .024 | .007 |
| Loadings | |||||
| Variables | Latent Factors | ||||
| 1 | 2 | 3 | 4 | 5 | |
| [sex=Male] | .065 | .182 | .705 | .871 | -.677 |
| [race=White] | .534 | .615 | -.207 | .461 | .187 |
| [race=Black] | -.531 | -.617 | .208 | .482 | .171 |
| age | .370 | -.380 | -.507 | .896 | -.577 |
| prestg80 | .652 | -.259 | .417 | -.576 | .380 |
| [happy=Very Happy] | .759 | -.713 | 1.504 | -.833 | -.829 |
| [happy=Pretty Happy] | -.707 | .792 | -.578 | 1.150 | 1.371 |
| Variable Importance in the Projection | |||||
| Variables | Latent Factors | ||||
| 1 | 2 | 3 | 4 | 5 | |
| [sex=Male] | .108 | .219 | .310 | .310 | .312 |
| [race=White] | .663 | .783 | .775 | .779 | .783 |
| [race=Black] | .674 | .757 | .753 | .758 | .761 |
| age | 1.034 | 1.075 | 1.075 | 1.073 | 1.073 |
| prestg80 | 1.739 | 1.651 | 1.641 | 1.638 | 1.637 |
| Cumulative Variable Importance | |||||
| Parameters | ||
| Independent Variables | Dependent Variables | |
| [happy=Very Happy] | [happy=Pretty Happy] | |
| (Constant) | .049 | .724 |
| [sex=Male] | .013 | .017 |
| [race=White] | .034 | .048 |
| [race=Black] | -.020 | .027 |
| age | .001 | -.002 |
| prestg80 | .004 | -.003 |






Warning! SmartPLS works with standardized data. If one's data are not already standardized, it is essential when one of the Calculate menu choices is selected that the researcher specify "Mean 0, Var 1" as the data metric, causing SmartPLS to standardize the data. (See the figure below). If data are already standardized, specify "Original" at this step.

The SmartPLS data screen shows the first several lines of raw data at the top and then in the window below, the converted data. Data are usually entered in comma-delimited text format (.csv). Data may also be delimited by tabs, spaces, or semi-colons. Click on the appropriate "Choose delimiter:" choice, which will cause data to appear in the Preview window below.

Note the user interface allows switching among multiple projects and datasets, according to the tab pressed on the Projects row at the top. Not shown in the illustration above, there is also a Help window in a pane to the right of the screen.

In this example we create a simple linear regression model in which OccStat and Incent1 predict Motive1. We do this by first selecting the Insert tool, enlarged below, to drag and draw the three ellipses. We right-click to rename the ellipses as Incentives, SES, and Motivation. We then drag the indicators (OccStat, Incent1, and Motive1) to their respective ellipses. We then use the Connection tool, enlarged below, to draw the arrows connecting the ellipses (the arrows connecting the indicators are added automatically). The Select tool can be used to move objects on the diagram. Right-clicking in the Projects pane in the upper left also allows the researcher to copy models or data from one project to another.

Thus PLS regression can be accomplished by creating single-indicator latent factors (the ellipses), though there is little point to that.
SmartPLS requires explicit specification of latent factors prior to analysis. In the model below, four predictor variables are specified as indicators for the latent factor Predictors, and two others for the latent factor Motivation, which represents the dependents. (SmartPLS output shows Cronbach's alpha for the two latents, with Predictors being only -.15 - unacceptably low, demonstrating that the indicators cannot be construed to be measures of the same latent factor).
SPSS PLS takes an exploratory rather than confirmatory approach. That is, there is no requirement to specify the number of latent dimensions a priori nor to associate indicators with particular dimensions. Rather, the researcher can accept the default (up to 5 dimensions) as described earlier in this section. Even if the researcher constrains the model to the same number of dimensions as in the SmartPLS solution, the SPSS-computed latent factors will be data-driven dimensions rather than the theory-driven dimensions required by SmartPLS modeling. Therefore the coefficients will be different and have a different meaning. Also, model fit will differ. The SmartPLS model below has the Predictors set explaining 42.6% of the variance in the dependent Motivation set. The corresponding SPSS PLS solution sets R2 at 35.1% explained for the solution constrained to one X-variable factor on up to 37.2% for the four-factor solution. However, these percentages are not comparable as the factors have different meanings.
All of which is to say that SmartPLS users would rarely do PLS regression modeling at all. Rather they would do PLS path modeling as described below, specifying the number of predictor latent factors (X variable dimensions) in advance and associating specified indicators with each. Path modeling and the details of SmartPLS output are described below.


where ...
Note that by using t-tests, this procedure reintroduces distributional assumptions into PLS, which otherwise is a distribution-free procedure. However, Henseler, Ringle, & Sankovics (2009: 309-310) have suggested a new distribution-free procedure for testing differences in b coefficients across groups not outlined here.

AVE may also be used to establish discriminant validity by the Fornell–Larcker criterion: for any latent variable, its AVE should be higher than its squared correlation with any other latent variable.

An ideal model would have strong expected loadings and weak cross-loadings. Here, the expected loadings are strong but cross-loadings are greater than in a model with simple factor structure. Lack of simple factor structure diminishes the meaningfulness of factor labels (ex., the Incentives factor here still has substantial crossloadings with the indicators for Motivation).




1. Install SPSS (SPSS CD) 2. Install Python (SPSS CD) 3. Install SPSS-Python Integration Plug-in (from SPSS CD) 4. Install NumPy and SciPy (From SPSS CD under Python and Additional Modules; Note this option installs Python, NumPy, and SciPy in order if they are not already present) 5. Install PLS Extension available at: https://www.ibm.com/developerworks/mydeveloperworks/files/app/person/270002VCWN/file/33319ac0-6f93-4040-9094-f40e7da9e7a8. You can log in as Guest/Guest. After unzipping, copy plscommand.xml and PLS.py to the extensions subdirectory under the SPSS Statistics installation directory.
Note, however, that this does not mean that multicollinearity just "goes away." Multicollinearity of the factor indicators in the measurement model (the outer model) is still problematic. To the extent that the original X variables are multicollinear, PLS will lack a simple factor structure and the factor cross-loadings will mean PLS factors will be difficult to label, interpret, and distinguish.
However, Marcoulides and Saunders (2006, p. vi) have noted that even moderate non-normality of data will require a markedly larger sample size in PLS, even if indicators are highly reliable. Based on simulation studies, Qureshi & Compeau (2009) found neither PLS nor SEM could consistently detect differences across groups when the dependent variable was highly skewed or kurtotic, though both PLS and SEM detected inter-group differences in other paths in the model not involving the dependent. However, Hsu, Chen, & Hsieh (2006), using simulation studies to compare PLS, SEM, and neural networks for moderate skewness, found that " all of the SEM techniques are quite robust against the skewness scenario" (pp. 368-369).
The appropriate sample size choice is more complex than any rule of thumb. The appropriate size depends in part both on the degree that factor structure is well defined (ex., are weights > .70?) and how small are the path coefficients the researcher seeks to establish (prove different from 0; ex., a much larger sample is needed to establish path coefficients of .1 than .7). Marcoulides & Saunders (2006), based on simulation studies, have published a table (p. vii) addressing the question of "what sample sizes would be needed to achieve a sufficient level of power, say equal to .80 (considered by most researchers as acceptable power) to reject the hypothesis that the factor correlation in the population is zero." Their Monte Carlo results show that while, indeed, PLS estimates may be reliable for very small samples (ex., 17), this is true only when factor loadings are large and the researcher is examining high factor correlations. On the other hand, their experiments suggested, for example, that using indicators with 0.7 factor loadings and examining factor correlations of .2, a sample of size 1,261 would be required to achieve the .80 power level. A sample size of 98 was sufficient for loadings equal or greater than .6 when establishing correlations equal or greater than .4. A sample size of 23 was sufficient for loadings equal or greater than .7 when establishing correlations equal or greater than .6.
Qureshi & Compeau (2009) also used Monte Carlo simulation to research related issues. They found PLS better than SEM when data were normally distributed, with a small sample size and correlated exogenous variables. They also found, however, that with large sample sizes and normally distributed data, both approaches consistently detected differences across groups. Neither PLS nor SEM performed well for datasets where the dependent was non-normal, though for smaller samples at moderate effect sizes, PLS outperformed SEM in detecting intergroup differences in other paths in the model (paths not involving the dependent).

This figure highlights that PLS models indicators without error whereas SEM is useful in modeling error. In addition, PLS models endogenous latent factors without disturbance terms.
PLS generally yields the most accurate predictions and therefore has been much more widely used than PCR. PLS may also be more parsimonious than PCR. In a chemistry setting, Wentzell & Vega (2003: 257) conducted simulations to compare PLS and PCR, finding "In all cases, except when artificial constraints were placed on the number of latent variables retained, no significant differences were reported in the prediction errors reported by PCR and PLS. PLS almost always required fewer latent variables than PCR, but this did not appear to influence predictive ability."
Attempts have been made to improve the predictive power of PCR. Traditional PCR methods use the first k components (first by having the highest eigenvalues) to predict the response variable, Y. Hwang & Nettleton (2003: 71 ) note, "Restricting attention to principal components with the largest eigenvalues helps to control variance inflation but can introduce high bias by discarding components with small eigenvalues that may be most associated with Y. Jollife (1982) provided several real-life examples where the principal components corresponding to small eigenvalues had high correlation with Y . Hadi and Ling (1998) provided an example where only the principal component associated with the smallest eigenvalue was correlated with Y ." Recall variance inflation (measured in regression by the variance inflation factor, VIF) indicates multicollinearity: while a multicollinear model may explain a high proportion of variance in Y, but redundancy among the X variables leads to inflated standard error and inflated parameter estimates. Minimizing variance inflation may not minimize mean square error (MSE). To deal with the tradeoff between variance inflation and MSE, some researchers emply an "inferential approach", which uses only components whose regression coefficients significantly differ from zero (Mason & Gunst, 1985). More recently, Hwang & Nettleton (2003) have proposed a PCR selection strategy which selects components which minimize mean square error (MSE) demonstrating through simulations studies that their estimator performed superior to traditional PCR, inferential PCR, or even traditional PLS (which ranked second in the simulation, among many variants tested). However, it appears that Hwang-Nettleton estimators are not employed by current software. .
PCR. With METHODS=PCR one is asking for principal components regression, which predicts response variables from factors underlying the predictor variables. Latents created with PCR may not predict Y-scores as well as latents created by PLS or SIMPLS.
While CTA-PLS generally follows the same principles and procedures as CTA-SEM, there are some differences. Because PLS does not conform to distributional assumptions required by conventional significance testing, tetrads are tested using bootstrap methods.
Copyright 1998, 2008, 2009, 2010, 2011 by G. David Garson.
Do not post on other servers, even for educational use.
Last update 4/16/2011.