|
|
Overview
This module discusses the most common data assumptions found in statistical research, as listed in the table of contents to the right.. |
Additivity Data level Equality of means Homogeneity of variances Homogeneity of variance-covariance matricess Homogeneity of regressions Homoscedasticity Independence Linearity Multicollinearity Multivariate normality Normality, skew, & transformations Normally distributed error Outliers Proper model specification Randomness Sound measurement Sphericity Unidimensionality |
For instance, parametric statistics are those which assume a certain distribution of the data (usually the normal distribution), assume an interval level of measurement, and assume homogeneity of variances when two or more samples are being compared. Most common significance tests (z tests, t-tests, and F tests) are parametric. However, it has long been established that moderate violations of parametric assumptions have little or no effect on substantive conclusions in most instances (ex., Cohen, 1969: 266-267.)
As a rule of thumb, the lower the overall effect (ex., R2 in multiple regression, goodness of fit in logistic regression), the more likely it is that important variables have been omitted from the model and that existing interpretations of the model will change when the model is correctly specified. The specification problem is lessened when the research task is simply to compare models to see which has a better fit to the data, as opposed to the purpose being to justify one model and assess the relative importance of the independent variables.
Note that a set of items may be considered to be unidimensional using one of the methods above, even when another method would fail to find statistical justification in considering the items to measure a single construct. For instance, it would be quite common to find that items in a large Guttman scale would fail to load on a single factor using factor analysis as a method. Finding satisfactory model fit in SEM would not assure that the Cronbach's alpha criterion was met. The researcher must decide on theoretical grounds what definition and criterion for unidimensionality best serves his or her research purpose. However, some method must always be used before proceeding to use multiple indicators to measure a concept.
Normal distributions take the form of a symmetric bell-shaped curve. The standard normal distribution is one with a mean of 0 and a standard deviation of 1. Standard scores, also called z-scores or standardized data, are scores which have had the mean subtracted and which have been divided by the standard deviation to yield scores which have a mean of 0 and a standard deviation of 1. Normality can be visually assessed by looking at a histogram of frequencies, or by looking at a normal probability plot output by most computer programs.
The area under the normal curve represents probability: 68.26% of cases will lie within 1 standard deviation of the mean, 95.44% within 2 standard deviations, and 99.14% within 3 standard deviations. Often this is simplified by rounding to say that 1 s.d. corresponds to 2/3 of the cases, 2 s.d. to 95%, and 3 s.d. to 99%. Another way to put this is to say there is less than a .05 chance that a sampled case will lie outside 2 standard deviations of the mean, and less than .01 chance that it will lie outside 3 standard deviations. This statement is analogous to statements pertaining to significance levels of .05 and .01, for two-tailed tests. .
Negative skew is right-leaning, positive skew is left-leaning. For each type of skew, the mean, median, and mode diverge, so all three measures of central tendency should be reported for skewed data. Box-Cox transformation may normalize skew. Right-skewed distribution may fit power, lognormal, gamma, Weibull, or chi-square distributions. Left-skewed distributions may be recoded to be right-skewed. (Note: there is confusion in the literature about what is "right" or "left" skew, but the foregoing is the most widely accepted labeling.)
Various transformations are used to correct kurtosis: cube roots and sine transforms may correct negative kurtosis. In SPSS, one of the places kurtosis is reported is under Analyze, Descriptive Statistics, Descriptives; click Options; select kurtosis.
For a given variable, W should not be significant if the variable's distribution is not significantly different from normal, as is the case for StdEduc in the illustration above. W may be thought of as the correlation between given data and their corresponding normal scores, with W = 1 when the given data are perfectly normal in distribution. When W is significantly smaller than 1, the assumption of normality is not met. Shapiro-Wilk's W is recommended for small and medium samples up to n = 2000. For larger samples, the Kolmogorov-Smirnov test is recommended by SAS and others.
To take a more complex example, a box plot can also be a chart in which categories of a categorical independent or of multiple independents are arrayed on the X axis and values of an interval dependent are arrayed on the X axis. In the example below there are two categorical independents (country of car manufacture, number of cylinders) predicting the continuous dependent variable horsepower.
Inside the graph, for each X category, will be a rectangle indicating the spread of the dependent's values for that category. If these rectangles are roughly at the same Y elevation for all categories, this indicates little difference among groups. Within each rectangle is a horizontal dark line, indicating the mean. If most of the rectangle is on one side or the other of the mean line, this indicates the dependent is skewed (not normal) for that group (category). Whiskers and outliers are as described above for the one-variable case.
Warnings. Transformations should make theoretical sense. Often, normalizing a dichotomy such as gender will not make theoretical sense. Also note that as the log of zero is undefined and leads to error messages, researchers often add some arbitrary small value such as .001 to all values in order to remove zeros from the dataset. However, the choice of the constant can affect the significance levels of the computed coefficients for the logged variables. If this strategy is pursued, the researcher should employ sensitivity analysis with different constants to note effects which might change conclusions.
In general, the Box-Cox procedure is to (1) Divide the independent variable into 10 or so regions; (2). Calculate the mean and s.d. for each region; (3). Plot log(s.d.) vs. log(mean) for the set of regions; (4). If the plot is a straight line, note its slope, b, then transform the variable by raising the dependent variable to the power (1 - b), and if b = 1, then take the log of the dependent variable; and (5) if there are multiple independents, repeat steps 1 - 4 for each independent variable and pick a b which is the range of b's you get.
In practice, computer packages apply an iterative maximum-likelihood algorithm to compute lambda, a Box-Cox parameter used to determine the exact power transformation which will best de-correlate the variances and means of the groups formed by the independent variables. As a rule of thumb, if lambda is 1.0, no transformation is needed. A lambda of +.5 corresponds to a square root transform of the dependent variable; lambda of 0 corresponds to a natural log transform; -.5 corresponds to a reciprocal square root transform; and a lambda of -1.0 corresponds to a reciprocal transform. The Box-Cox transformation is not yet supported in SPSS. However, Michael Speed of the Texas A & M University, has made available SPSS syntax code for Box-Cox transformation at http://www.stat.tamu.edu/ftp/pub/mspeed/stat653/spss/, with tutorials at http://www.stat.tamu.edu/spss.php.
See Box, G. E. P. and D. R. Cox (1964). An analysis of transformations. Journal of the Royal Statistical Society, B, 26, 211-234; see also Maddala, G. S. (1977). Econometrics. New York: McGraw-Hill. (page 315-317); or Mason, R., L., R. F. Gunst, and J. L. Hess (1989). Statistical design and analysis of experiments with applications to engineering and science. New York: Wiley.
In SPSS, select Analyze, Regression, Linear; click the Save button; check Cook's, Mahalanobis, and/or leverage values.
Bollinger & Chandra (2005) and others have found that while trimming or winsorizing data can increase power without significantly increasing Type I errors in some circumstances, by pulling observations toward the mean it is also possible to introduce bias.
In the example below, Score is predicted from Seniority. In this example, graphical inspection leaves doubt whether error is normally distributed.
When the residuals for the example above are saved, the SPSS Analyze, Descriptive Statistics, Explore menu choices will generate, among other statistics, the skew and kurtosis, which are .436 and -.840 respectively for these data - within normal bounds. However, as the detrended Q-Q plot for these residuals shows (below), residuals are below normal expectations for middle ranges and above for high and low ranges. This is indicative of a bimodal rather than normal distribution of error. It is also an example of where the skew and kurtosis "rules of thumb" give misleading average values.
In the example below, Zodiac (Zodiac sign) is used to predict Polviews (liberal or conservative). As expected, the ANOVA is non-significant, indicating the Zodiac does not predict Polviews. Because the Levene statistic is not significant, the researcher fails to reject the null hypothesis that the groups have equal variances. Frequencies for Zodiac, not shown here, show group sizes not be be markedly different, so the results of the Levene test are accepted. However, were the group sizes markedly different, the Brown & Forsyth test would be used. For these data, the Brown & Forsyth test is also non-significant and thus not different in inference from Levene's test.
If ZPR_1, ZRE_1, or other needed variables have been saved, you can also use Graphs, Legacy Dialogs, Scatter/Dot. In the output below, for instance, education was used to predict income and the standardized predicted and residual values were saved. The plot is largely a cloud (indicating homoscedasticity) but there is some pattern showing that higher predicted values have lower residuals (lack of homoscedasticity).
An example of multicollinearity occurred in a Bureau of Labor Statistics study of the price of camcorders. The initial model included the dummy variables Sony and 8mm, both of which corresponded to high price. However, since Sony was the only manufacturer of 8mm camcorders at the time, the Sony and 8 mm dummy variables were multicollinear. A similar multicollinearity occurred in a BLA study of washing machines, where it was found that "capacity" and "number of cycles" were multicollinear. In each study, one of the collinear variables had to be dropped from the model.
Whereas perfect multicollinearity leads to infinite standard errors and indeterminant coefficients, the more common situation of high multicollinearity leads to large standard errors, large confidence intervals, and diminished power (the chance of Type II errors is high - thinking you do not have a relationship when in fact one exists - failure to reject the null hypothesis that the coefficients are not different from zero). R-square is high. The coefficients and their standard errors will be sensitive to changes in just a few observations.
Methods of handling high multicollinearity are discussed elsewhere.
Copyright 1998, 2008, 2009, 2010 by G. David Garson.
Last update 1/25/2010.