- A normal distribution is assumed by many statistical procedures. Various transformations are used to correct non-normally distributed data. Correlation, least-squares regression, factor analysis, and related linear techniques are relatively robust against non-extreme deviations from normality provided errors are not severely asymmetric (Vasu, 1979). Severe asymmetry might arise due to strong outliers. Log-linear analysis, logistic regression, and related techniques using maximum likelihood estimation are even more robust against moderate departures from normality (cf. Steenkamp & van Trijp, 1991: 285). Likewise, Monte Carlo simulations show the t-test is robust against moderate violations of normality (Boneau, 1960).
Normal distributions take the form of a symmetric bell-shaped curve. The standard normal distribution is one with a mean of 0 and a standard deviation of 1. Standard scores, also called z-scores or standardized data, are scores which have had the mean subtracted and which have been divided by the standard deviation to yield scores which have a mean of 0 and a standard deviation of 1. Normality can be visually assessed by looking at a histogram of frequencies, or by looking at a normal probability plot output by most computer programs.
The area under the normal curve represents probability: 68.26% of cases will lie within 1 standard deviation of the mean, 95.44% within 2 standard deviations, and 99.14% within 3 standard deviations. Often this is simplified by rounding to say that 1 s.d. corresponds to 2/3 of the cases, 2 s.d. to 95%, and 3 s.d. to 99%. Another way to put this is to say there is less than a .05 chance that a sampled case will lie outside 2 standard deviations of the mean, and less than .01 chance that it will lie outside 3 standard deviations. This statement is analogous to statements pertaining to significance levels of .05 and .01, for two-tailed tests. .
- Skew is the tilt (or lack of it) in a distribution. The more common type is right skew, where the tail points to the right. Less common is left skew, where the tail is points left. A common rule-of-thumb test for normality is to run descriptive statistics to get skewness and kurtosis, then divide these by the standard errors. Skew should be within the +2 to -2 range when the data are normally distributed. Some authors use +1 to -1 as a more stringent criterion when normality is critical. In SPSS, one of the places skew is reported is under Analyze, Descriptive Statistics, Descriptives; click Options; select skew.
Negative skew is right-leaning, positive skew is left-leaning. For each type of skew, the mean, median, and mode diverge, so all three measures of central tendency should be reported for skewed data. Box-Cox transformation may normalize skew. Right-skewed distribution may fit power, lognormal, gamma, Weibull, or chi-square distributions. Left-skewed distributions may be recoded to be right-skewed. (Note: there is confusion in the literature about what is "right" or "left" skew, but the foregoing is the most widely accepted labeling.)
- Kurtosis is the peakedness of a distribution. A common rule-of-thumb test for normality is to run descriptive statistics to get skewness and kurtosis, then use the criterion that kurtosis should be within the +2 to -2 range when the data are normally distributed (a few authors use the more lenient +3 to -3, while other authors use +1 to -1 as a more stringent criterion when normality is critical). Negative kurtosis indicates too many cases in the tails of the distribution. Positive kurtosis indicates too few cases in the tails. Note that the origin in computing kurtosis for a normal distribution is 3 and a few statistical packages center on 3, but the foregoing discussion assumes that 3 has been subtracted to center on 0, as is done in SPSS and LISREL. The version with the normal distribution centered at 0 is Fisher kurtosis, while the version centered at 3 is Pearson kurtosis. SPSS uses Fisher kurtosis. Leptokurtosis is a peaked distribution with "fat tails", indicated by kurtosis > 0 (for Fisher kurtosis, or > 3 for Pearson kurtosis). Platykurtosis is less peaked "thin tails" distribution, with a kurtosis value < 0 (for Fisher kurtosis, or < 3 for Pearson kurtosis).
Various transformations are used to correct kurtosis: cube roots and sine transforms may correct negative kurtosis. In SPSS, one of the places kurtosis is reported is under Analyze, Descriptive Statistics, Descriptives; click Options; select kurtosis.
- Dichotomies. By definition, a dichotomy is not normally distributed. Many researchers will use dichotomies for procedures requiring a normal distribution as long as the split is less than 90:10. Dichotomies should not be used as dependents in procedures, such as OLS regression, which assume a normally distributed dependent variable.
- Shapiro-Wilk's W test is a formal test of normality offered in the SPSS EXAMINE module or the SAS UNIVARIATE procedure. This is the standard test for normality.
In SPSS, select Analyze, Descriptive statistics, Explore. Note the Explore menu choice pastes the EXAMINE code. Click the Plots button and check "Normality plots with tests." Output like that below is generated:
For a given variable, W should not be significant if the variable's distribution is not significantly different from normal, as is the case for StdEduc in the illustration above. W may be thought of as the correlation between given data and their corresponding normal scores, with W = 1 when the given data are perfectly normal in distribution. When W is significantly smaller than 1, the assumption of normality is not met. Shapiro-Wilk's W is recommended for small and medium samples up to n = 2000. For larger samples, the Kolmogorov-Smirnov test is recommended by SAS and others.
- Kolmogorov-Smirnov D test or K-S Lilliefors test, is an alternative test of normality for large samples, available in SPSS EXAMINE and SAS UNIVARIATE. This test is also found in SPSS under Analyze, Descriptive Statistics, Explore, Plots when one checks "Normality plots with tests." Output is illuatrated above. Kolmogorov-Smirnov D is sometimes called the Lilliefors test as a correction to K-S developed by Lilliefors is now normally applied. SPSS (as of Version 9), for instance, automatically applies the Lilliefors correction to the K-S test for normality in the EXAMINE module (but not in the NONPAR module). This test, with the Lilliefors correction, is preferred to the chi-square goodness-of-fit test when data are interval or near-interval. When applied without the Lilliefors correction, K-S is very conservative: that is, there is an elevated likelihood of a finding of non-normality. Note the K-S test can test goodness-of-fit against any theoretical distribution, not just the normal distribution. Be aware that when sample size is large, even unimportant deviations from normality may be technically significant by this and other tests. For this reason it is recommended to use other bases of judgment, such as frequency distributions and stem-and-leaf plots.
- Graphical methods.
- A histogram of a variable shows rough normality, and a histogram of residuals, if normally distributed, is often taken as evidence of normality of all the variables.
- A graph of empirical by theoretical cumulative distribution functions (cdf's) simply shows the empirical distibution as, say, a dotted line, and the hypothetical distribution, say the normal curve, as a solid line.
- A P-P plot is found in SPSS under Graphs, P-P plots. One may test if the distribution of a given variable is normal (or beta, chi-square, exponential, gamma, half-normal, Laplace, Logistic, Lognormal, Pareto, Student's t, Weibull, or uniform). he P=P plot plots a variable's cumulative proportions against the cumulative proportions of the test distribution.The straighter the line formed by the P-P plot, the more the variable's distribution conforms to the selected test distribution (ex., normal). Options within this SPSS procedure allow data transforms first (natural log, standardization of values, difference, and seasonally difference).
- A quantile-by-quantile or Q-Q plot forms a 45-degree line when the observed values are in conformity with the hypothetical distribution. Q-Q plots plot the quantiles of a variable's distribution against the quantiles of the test distribution.From the SPSS menu, select Graphs, Q-Q. The SPSS dialog box supports testing the following distributions: beta, chi-square, exponential, gamma, half-normal, Laplace, Logistic, Lognormal, normal, pareto, Student's t, Weibull, and uniform. Q-Q plots are also produced in SPSS under Analyze, Descriptive Statistics, Explore, Plots when one checks "Normality plots with tests."
- A detrended Q-Q plot, obtained in the same way in SPSS, provides similar information. If a variable is normally distributed, cases in the detrended Q-Q plot should cluster around the horizontal 0 line representing 0 standard deviations from the 45=degree line seen in the non-detrended Q-Q plot above. The detrended Q-Q plot is useful for spotting outliers. For the illustration below, however, there are no outliers in the sense that there are no cases more than +/- .12 standard deviations away. Cases more than +/- 1.96 standard deviations away are outliers at the .95 confidence level.
- Boxplot tests of the normality assumption. Outliers and skewness indicate non-normality, and both can be checked with boxplots. The SPSS boxplot output option (also under Analyze, Descriptive Statistics, Explore, Plots button, check "Normality plots with tests"). For a single variable being tested for normality, a box plot is a chart with that variable on the X axis and with the Y axis representing its spread of values (in the illustration below, the values of standardized education, StdEduc).
Inside the graph, for the given variable, the height of the rectangle indicates the spread of the values for the variable. The horizontal dark line within the rectangle indicates the mean. In the illustration it is 0 since zero is the mean for standardized variables, which StdEduc is. If most of the rectangle is on one side or the other of the mean line, this indicates the dependent is skewed (not normal). Further out than the rectangle are the "whiskers," which mark the smallest and largest observations which are not outliers (defined as observations greater than 1.5 inter-quartile ranges [IQR's = boxlengths] from the 1st and 3rd quartiles). Outliers are shown as numbered cases beyond the whiskers. ( Note you can display boxplots for two factors (two independents) together by selecting Clustered Boxplots from the Boxplot item on the SPSS Graphs menu.)
To take a more complex example, a box plot can also be a chart in which categories of a categorical independent or of multiple independents are arrayed on the X axis and values of an interval dependent are arrayed on the X axis. In the example below there are two categorical independents (country of car manufacture, number of cylinders) predicting the continuous dependent variable horsepower.
Inside the graph, for each X category, will be a rectangle indicating the spread of the dependent's values for that category. If these rectangles are roughly at the same Y elevation for all categories, this indicates little difference among groups. Within each rectangle is a horizontal dark line, indicating the mean. If most of the rectangle is on one side or the other of the mean line, this indicates the dependent is skewed (not normal) for that group (category). Whiskers and outliers are as described above for the one-variable case.
- Resampling is a way of doing significance testing while avoiding parametric assumptions like multivariate normality. The assumption of multivariate normality is violated when dichtomous, dummy, and other discrete variables are used. In such situations, where significance testing is appropriate, researchers may use a resampling method.
- Normalizing Transformations. Various transformations are used to correct skew:
- Square roots, logarithmic, and inverse (1/x) transforms "pull in" outliers and normalize right (positive) skew. Inverse (reciprocal) transforms are stronger than logarithmic, which are stronger than roots.
- To correct left (negative) skew, first subtract all values from the highest value plus 1, then apply square root, inverse, or logarithmic transforms.
- For power and root transforms, finer adjustments can be obtained by adding a constant, C, where C is some small positive value such as .5, in the transform of X: X' = (X + C)P. When this researcher's data contain zero values, the transform using C is strongly recommended over straight transforms (ex., SQRT(X+.5), not SQRT(X) ), but the use of C is standard practice in any event. Values of P less than one (roots) correct right skew, which is the common situation (using a power of 2/3 is common when attempting to normalize). Values of P greater than 1 (powers) correct left skew. For right skew, decreasing P decreases right skew. Too great reduction of P will overcorrect and cause left skew. When the best P is found, further refinements can be made by adjusting C. For right skew, for instance, subtracting C will decrease skew.
- Logs vs. roots: logarithmic transformations are appropriate to achieve symmetry in the central distribution when symmetry of the tails is not important; square root transformations are used when symmetry in the tails is important; when both are important, a fourth root transform may work (fourth roots are used to correct extreme skew)..
- Logit and probit transforms. Schumacker & Lomax (2004: 33) recommend probit transforms as a means of dealing with skewness. See Lipsey & Wilson (2001: 56) for discussion of logit and probit transforms as a means of transforming dichotomous data as part of estimating effect sizes.
- Percentages may be normalized by an arcsine transformation, which is recommended when percentages are outside the range 30% - 70%. The more observations outside this range or the closer to 0% and/or 100%, the more normality is violated and the stronger the recommendation to use arcsine transformation. However, arcsine transformation is not effective when a substantial number of observations are 0% or 100%, or when sample size is small. The usual arcsine transformation is p' = arcsin(SQRT(p)), where p is the percentage or proportion.
- Poisson distributions may be normalized by a square root transformation.
- Other strategies to correct for skew include collapsing categories and dropping outliers.
Warnings. Transformations should make theoretical sense. Often, normalizing a dichotomy such as gender will not make theoretical sense. Also note that as the log of zero is undefined and leads to error messages, researchers often add some arbitrary small value such as .001 to all values in order to remove zeros from the dataset. However, the choice of the constant can affect the significance levels of the computed coefficients for the logged variables. If this strategy is pursued, the researcher should employ sensitivity analysis with different constants to note effects which might change conclusions.
- Transforms in SPSS: Select Transform - Compute - Target Variable (input a new variable name) - Numeric Expression (input transform formula)
- Box-Cox Transformations of Dependent Variables. Box & Cox proposed a maximum likelihood method in 1964 for determining the optimal power transform for purposes of normalization of data. Power transformations of dependent variables were advanced to remedy model lack of normal distribution, lack of homogeneity of variances, and lack of additivity. In a regression context, Box-Cox transformation addresses the problem of non-normality, indicated by skewed residuals (the transformation pulls in the skew) and/or by lack of homoscedasticity of points about the regression line. In an ANOVA context, the Box-Cox transformation addresses the problem of lack of homogeneity of variances associated with the correlation of variances with means in the groups formed by the independent factors and indicated by skewed distributions within the groups (the transformation reduces the correlation).
In general, the Box-Cox procedure is to (1) Divide the independent variable into 10 or so regions; (2). Calculate the mean and s.d. for each region; (3). Plot log(s.d.) vs. log(mean) for the set of regions; (4). If the plot is a straight line, note its slope, b, then transform the variable by raising the dependent variable to the power (1 - b), and if b = 1, then take the log of the dependent variable; and (5) if there are multiple independents, repeat steps 1 - 4 for each independent variable and pick a b which is the range of b's you get.
In practice, computer packages apply an iterative maximum-likelihood algorithm to compute lambda, a Box-Cox parameter used to determine the exact power transformation which will best de-correlate the variances and means of the groups formed by the independent variables. As a rule of thumb, if lambda is 1.0, no transformation is needed. A lambda of +.5 corresponds to a square root transform of the dependent variable; lambda of 0 corresponds to a natural log transform; -.5 corresponds to a reciprocal square root transform; and a lambda of -1.0 corresponds to a reciprocal transform. The Box-Cox transformation is not yet supported in SPSS.
However, Michael Speed of the Texas A & M University, has made available SPSS syntax code for Box-Cox transformation at http://www.stat.tamu.edu/ftp/pub/mspeed/stat653/spss/, with tutorials at http://www.stat.tamu.edu/spss.php.
See Box, G. E. P. and D. R. Cox (1964). An analysis of transformations. Journal of the Royal Statistical Society, B, 26, 211-234; see also Maddala, G. S. (1977). Econometrics. New York: McGraw-Hill. (page 315-317); or Mason, R., L., R. F. Gunst, and J. L. Hess (1989). Statistical design and analysis of experiments with applications to engineering and science. New York: Wiley.