[Home]  [Syllabus]  [Statnotes]  [Links]  [Lab]  [Instructor]  [Home]

Correlation




    Overview

      Correlation is a bivariate measure of association (strength) of the relationship between two variables. It varies from 0 (random relationship) to 1 (perfect linear relationship) or -1 (perfect negative linear relationship). It is usually reported in terms of its square (r2), interpreted as percent of variance explained. For instance, if r2 is .25, then the independent variable is said to explain 25% of the variance in the dependent variable. In SPSS, select Analyze, Correlate, Bivariate; check Pearson.

      There are several common pitfalls in using correlation. Correlation is symmetrical, not providing evidence of which way causation flows. If other variables also cause the dependent variable, then any covariance they share with the given independent variable in a correlation may be falsely attributed to that independent. Also, to the extent that there is a nonlinear relationship between the two variables being correlated, correlation will understate the relationship. Correlation will also be attenuated to the extent there is measurement error, including use of sub-interval data or artificial truncation of the range of the data. Correlation can also be a misleading average if the relationship varies depending on the value of the independent variable ("lack of homoscedasticity"). And, of course, atheoretical or post-hoc running of many correlations runs the risk that 5% of the coefficients may be found significant by chance alone.

      Beside Pearsonian correlation (r), the most common type, there are other special types of correlation to handle the special characteristics of such types of variables as dichotomies, and there are other measures of association for nominal and ordinal variables. Regression procedures produce multiple correlation, R, which is the correlation of multiple independent variables with a single dependent. Also, there is partial correlation, which is the correlation of one variable with another, controlling both the given variable and the dependent for a third or additional variables. And there is part correlation, which is the correlation of one variable with another, controlling only the given variable for a third or additional variables. Click on these links to see the separate discussion.




Contents


Key Concepts and Terms


Assumptions

  1. Interval level data (for Pearsonian correlation).

  2. Linear relationships. It is assumed that the x-y scattergraph of points for the two variables being correlated can be better described by a straight line than by any curvilinear function. To the extent that a curvilinear function would be better, Pearson's r and other linear coefficients of correlation will understate the true correlation, sometimes to the point of being useless or misleading.

    Linearity can be checked visually by plotting the data. In SPSS, select Graphs, Scatter/Dots; select Simple Scatter; click Define; let the independent be the x-axis and the dependent be the y-axis; click OK. One may also view many scatterplots simultaneously by asking for a scatterplot matrix: in SPSS, select Graphs, Scatter/Dots, Matrix, Scatter; click Define; move any variables of interest to the Matrix Variable list; click OK.

  3. Homoscedasticity is assumed. That is, the error variance is assumed to be the same at any point along the linear relationship. Otherwise the correlation coefficient is a misleading average of points of higher and lower correlation,

  4. No outliers. Outlier cases can attenuate correlation coefficients. Scatterplots may be used to spot outliers visually (see above). A large difference between Pearsonian correlation and Spearman's rho may also indicate the presence of outliers.

  5. Minimal measurement error is assumed since low reliability attenuates the correlation coefficient. By definition, correlation measures the systematic covariance of two variables. Measurement error usually, with rare chance exceptions, reduces systematic covariance and lowers the correlation coefficient. This lowering is called attenuation. Restricted variance, discussed below, also leads to attenuation.

  6. Unrestricted variance If variance is truncated or restricted in one or both variables due, for instance, to poor sampling, this can also lead to attenuation of the correlation coefficient. This also happens with truncation of the range of variables as by dichotomization of continuous data, or by reducing a 7-point scale to a 3-point scale.

  7. Similar underlying distributions are assumed for purposes of assessing strength of correlation. That is, if two variables come from unlike distributions, their correlation may be well below +1 even when data pairs are matched as perfectly as they can be while still conforming to the underlying distributions. Thus, the larger the difference in the shape of the distribution of the two variables, the more the attenuation of the correlation coefficient and the more the researcher should consider alternatives such as rank correlation. This assumption may well be violated when correlating an interval variable with a dichotomy or even an ordinal variable.

  8. Common underlying normal distributions, for purposes of assessing significance of correlation. Also, for purposes of assessing strength of correlation, note that for non-normal distributions the range of the correlation coefficient may not be from -1 to +1 (see Shih and Huang, 1992). Evaluating correlation with proper bounds. Biometrics, Vol . 48: 1207-1213. ). The central limit theorem demonstrates, however, that for large samples, indices used in significance testing will be normally distributed even when the variables themselves are not normally distributed, and therefore significance testing may be employed. The researcher may wish to use Spearman or other types of nonparametric rank correlation when there are marked violations of this assumption, though this strategy has the danger of attenuation of correlation.

  9. Normally distributed error terms. Again, the central limit theorem applies.


Frequently Asked Questions


Bibliography



Copyright 1998, 2008 by G. David Garson.
Last update 3/24/08.