MULTIPLE CORRELATION AND REGRESSION
by G. David Garson
Regression analysis is among the most common of the more
advanced statistical techniques used by political scientists.
Where the correlation coefficient, r, was a measure of the extent
to which data for two variables fell in a straight line when
plotted in a scatter diagram, regression has to do with obtaining
the formula that describes that line.
For example, the formula y = 2x + 1 describes a straight
line which, when plotted on paper, intercepts the y axis at 1 and
increases 2x for every unit increase in y. More generally, a
straight line is described by the formula: y = bx + c, where y is
the dependent variable, x is the independent (predictor)
variable, c is the constant (equal to the y-intercept--where the
line crosses the y axis), and b is the regression coefficient.
In short, a regression coefficient is the slope of the
straight line that best describes a set of points in a scatter
diagram. (Later we will discuss curvilinear regression.) In the
sections which follow we discuss simple regression (one
independent variable), multiple regression (more than one
independent variable) and its cousin, multiple correlation, and
briefly touch on some related topics in regression analysis.
Correlation is enough when we only want to know the level of
association between variables, but when (as in model building) we
want to predict one variable from another, we must have the
actual equation of the line that describes the relationship: the
regression equation.
SIMPLE LINEAR REGRESSION
What criterion should be used to draw the "best" summary
line through points on a scatter diagram? In simple cases, one
might well approximate the best solution by simply inspecting the
diagram visually and drawing a line through it by intuition--the
freehand or "black thread" method-- although the subjectivity of
this procedure makes it impossible to test "goodness.' The best
line is considered to be the one where squared vertical
deviations of the points from the line are minimized.
STANDARD ERROR OF ESTIMATE, Syx
Computing the regression equation is not enough. We still
must decide how confident we should be about our predictions
using the equation. The standard error of estimate, Syx, may be
used on normally distributed interval data to answer this
question. If the standard error of estimate is 12.1, for
instance, this means that any y value (predicted dependent value)
estimated by the regression equation will be within plus or minus
1.96 times 12.1, or 24.2 units, of the real value about 95
percent of the time.
Plus or minues about two standard errors may be a large
amount compared to the range of the dependent variable. It is
important to know when this is so, so that we may assess how much
we wish to rely on our prediction equations under regression.
Standard error of estimate must be given along with regression
coefficients.
THE COEFFICIENTS COMPARED
What is the relationship between correlation, regression, and
standard error of estimate?
Comparing purpose, correlation measures the amount
(strength) of association; regression provides a prediction
equation to estimate the dependent variable from data on the
independent variable(s); and standard error of estimate gives us
confidence limits on such estimates.
Comparing magnitude, a high absolute correlation coefficient
means that a hi~h proportion of the dependent variable is
explained by the independent variable; a high absolute regression
coefficient means that a large amount of change in the dependent
variable occurs when the independent variable changes; and a high
standard error of estimate means that there is much error
incurred in predicting the dependent variable from the inde-
pendent variable.
For simple correlation of two variables, the correlation
coefficient is the standardized regression coefficient, beta (B).
Beta is the regression coefficient, b, computed for standardized
data (which have a mean of 0 and a standard deviation of 1,
achieved by subtracting the mean and dividing by the standard
deviation).
Note that the regression coefficient, b, is the slope of the
regression line. Remembering back to high school, the slope is
the amount y changes for each unit change in x. Note also that
there is no assurance that the slope would be the same for values
of x higher or lower than the observed values of x; that is, our
regression analysis pertains to predictions for the observed
range of x-x and y may have a different sort of relationship
outside this range. The range of x, of course, is the values x
may assume between the lowest and highest observed x values
inclusive. The regression coefficient, b, is the rate of change
in y over the observed range of x.
Finally, note that simple regression makes the same
assumptions as does product-moment correlation, r: interval data,
linear relationship, homoscedasticity, common (normal) underlying
distribution and variance. We also assume that the error term is
uncorrelated with the independent variable and is normally
distributed with a mean of zero. (The error term is the observed
value of y minus the value of y predicted by the regression
equation.)
What happens when assumptions are violated? Moderate
violations of the assumptions of homoscedasticity and of
normality of error term distribution appear to have little effect
in most situations, assuming a reasonably large sample size.
Substitution of ordinal for interval data, and other forms of
measurement error, do lead to serious consequences for regression
analysis estimates, however. Nonlinearity also seriously
undermineslinearregression analysis, but, as will be discussed
later, curvilinear regression procedures exist to handle such
data.
Since regression estimates may be seriously distorted
because of measurement error, wherever possible, reliability
coefficients pertaining to one's measurement instruments should
be presented when regression analysis is undertaken.
MULTIPLE LINEAR REGRESSION AND MULTIPLE CORRELATION
Multiple regression is a way of predicting the dependent
variable from two or more independent variables. Multiple
correlation is a closely related statistic used in determining
the proportion of variance of the dependent variable explained by
two or more independent variables. Since computation of these
statistics is so laborious that it is always done by computer,
only the three-variable (one dependent, two independent) case
will be discussed.
If x1 is the dependent variable (e.g., number of labor
riots) and x2 and x3 are the independent variables (e.g.,
unemployment and real wages), then the formula for the multiple
regression line looks like this:
x1 = b12.3(x2) + b13.2(x3) + c1.23
The dependent variable equals the partial regression
coefficient of the dependent with the first independent variable,
controlling for the second, times the first independent variable,
plus a similar quantity for the second independent variable, plus
a constant.
The partial b coefficients in such a prediction equation are
called unstandardized regression coefficients. Because these are
not affected by the range of the data, as are correlation
coefficients, some have argued for their use as measures of
association in explaining change. Their disadvantage however, is
that they cannot be compared to tell the relative importance of
each independent variable in predicting the dependent variable.
Therefore, standardized regression coefficients, also called
beta weights, are obtained by multiplying the partial b
coefficients by the ratio of the standard deviation of the
independent variable to the standard deviation of the dependent
variable.
Beta weights and b coefficients are, of course, partial
regression coefficients whenever there are two or more
independent variables. Such partial coefficients "hold constant"
the effect of other independent variables in computing the
regression weight to be assigned to the given independent
variable. That is, such coefficients represent the regression of
the given variable on the residual values of the dependent
variable (residuals equal the original values minus the values
estimated by the regression equation involving the other
independent variables).
The relative importance of the independent (predictor)
variables in a regression equation is the ratio of their beta
weights. We use the b coefficients to make the actual
predictions, but we use the B coefficients to compare the
relative power of the independent variables.
Note that a relatively high beta coefficient does not
necessarily mean that the associated variable is the most
important. This is because regression analysis assumes a causal
model in which all relevant variables have been considered
explicitly. That is, the regression coefficient will be
relatively large if the associated variable has great apparent
effect on the dependent variable, or if unmeasured independent
variables correlated with such a variable have an effect on the
dependent variable.
MULTICOLLINEARITY
When two or more independent variables in regression
analysis are highly correlated (e.g., above .80), the associated
regression coefficients will have a large standard error. Since
we cannot rely on the regression coefficients in such a
situation, it becomes difficult or impossible to make causal
inferences. That is, when multicollinearity exists we cannot use
the ratio of the B weights to assess the relative importance of
the independent variables. When the correlation of two
independent variables is perfect (1.0), it is impossible to
separate their effects on the dependent variable by regression
analysis.
Time series data are particularly susceptible to
multicollinearity since many variables increase over time,
yielding high intercorrelations. It is also a common problem in
research on economic variables. When multicollinearity is
present, the researcher may choose to drop one of the highly
correlated variables from the analysis (but remember the
assumption noted above), or to conduct two or more analyses,
dropping different variables to note the effect.
Neither approach constitutes a real "solution" to the prob-
lem, however; high multicollinearitv indicates use of methods
other than regression when the researcher's purpose is causal
inference (as opposed to simple prediction). When regression
analysis is pursued in spite of high multicollinearity, the
researcher may be tempted into attaching undue theoretical
significance to large variations in regression coefficients that
are based on small differences in correlation.
In sum, regression analysis assumes that the independent
variables are uncorrelated with unmeasured independent variables,
not (or at least not highly) correlated with each other, and not
themselves influenced by two-way causation with the dependent
variable; moreover, these causal relations are assumed to be
linear and additive.
MULTIPLE CORRELATION
Multiple correlation, R, is the square root of the
proportion of the variance in the dependent variable "explained"
by the independent variables. Thus it is very closely related to
standardized multiple regression.
R-square is often called the coefficient of multiple
determination, just as r-squared is termed the coefficient of
determination: the percent of variance in the dependent variable
"explained" by the independent variable(s).
SIGNIFICANCE OF b COEFFICIENTS
Earlier we discussed the significance of b coefficients in
simple linear regression, and above we discussed the significance
test for multiple correlation. The test of significance for the
regression coefficient in multiple regression (i.e., the test for
partial regression coefficients and their corresponding B
weights) is the t-test. The significance of t should be .05 or
lower to retain the variable in the equation. Note: This
procedure is inappropriate when there is pretesting bias (e.g.,
when the b coefficients are from a regression equation from which
some variables have been dropped because of their insignificant t
levels on previous occasions).
STEPWISE REGRESSLON
Stepwise regression is a method usually used with computers,
whereby a regression equation is computed with one independent
variable, a second equation is computed with two variables, then
a third variable may be added, etc. First a correlation matrix is
computed. The computer selects as the first independent variable
that variable with the strongest correlation with the dependent
variable and computes the first step regression equation.
It then computes a partial correlation matrix, using the
first independent variable as a control. The second independent
variable chosen is that with the strongest partial correlation
with the dependent variable, and the second step regression
equation is computed. Then another partial correlation matrix is
computed using the first two independent variables as controls;
the third independent variable chosen is the one with the highest
second-order partial correlation with the dependent variable. The
fourth and later independent variables are chosen in an analogous
manner.
TESTING THE SIGNIFICANCE OF ADDITIONAL VARIABLES IN STEPWISE
REGRESSION
Each step in stepwise regression yields a corresponding
coefficient of multiple determination. The step for two
independent variables yields the coefficient R2 1.23; the step
for (m - 1) independent variables yields the coefficient R2
1.23...m. We may wish to know if adding one or more independent
variables significantly increases R(2). For this we use the F
test, as in the F test for multiple correlation; note that that
is the test for the regression line as a whole).
Note that this test is affected by the order in which
variables are entered. The F test may show that adding a given
independent variable may raise the R2 level insignificantly, yet
were it to be entered in an earlier step an opposite finding
might be made. This would happen if the given independent
variable were entered after an independent variable with which it
was highly correlated.
RELATION OF STEPWISE REGRESSION TO PART CORRELATION
The coefficient of multiple determination, R-square, may be
interpreted as the sum of a series of correlation and part
correlation coefficients.
That is, the coefficient of multiple determination equals
the coefficient of determination between the first independent
variable and the dependent variable, plus the sum of part
correlations squared for each successive independent variable,
controlling for previously entered independent variables.
CORRECTION FOR SMALL SAMPLE SIZE
Correlation is sensitive to small sample size. Correlation
statistics computed on the basis of small samples tend to
overstate the degree of association. For small samples (e.g.,
under 100), the researcher should correct the coefficient of
multiple determination. Most computer programs report adjusted
R-square where appropriate.
REGRESSION ANALYSIS USING DUMMY VARIABLES
If a variable is a dichotomy (assumes the values of zero or
1), it may be used in regression analysis even though it is not
interval in level (e.g., male = 0, female = 1). A nominal-level
variable that assumes more than two values may be treated as a
set of dichotomies (e.g., region, where "East" = 0 if the unit is
not in the East, 1 if it is; "South" = 0 if the unit is not in
the South, 1 if it is; and so on for as many regions as are
included in the study).
Through use of such dummy variables, regression analysis can
be extended to nominal-level data. Similarly, ordinal data (scale
ranks in a Guttman scale of conservatism, for example) or
interval data (age ranges, such as 10-19, 20-29, etc., for
example) may be converted to sets of dummy variables as well.
Dummy variable, analysis assumes that the dummy variables
(e.g., regions, conservatism scale types, age ranges) are
mutually exclusive. The dependent variable is assumed to be
continuous and interval in level. The dummy variables used are
not exhaustive however (i.e., one region, scale rank, or age
range must be omitted from the analysis).
The set of dummy variables is not exhaustive because of
reasons pertaining to the previous discussion of
multicollinearity. If a set of three dummy variables is
exhaustive (all units must receive a 1 score on one variable and
0 scores on the others), then when we know a unit's score on the
first two variables, we have predetermined its score on the third
variable. In other words, the third variable would be a linear
function of the first two, and this multicollinearity would lead
to a high standard error undermining the interpretability of the
regression coefficients.
Therefore, if there is one set of dummy variables, we must
either drop one to make the set nonexhaustive or set the constant
equal to zero. If there are two sets of dummy variables, we must
drop one variable from each set or drop one from one set and set
the constant equal to zero. For three sets we must drop one from
each, or drop two and set the constant at zero; etc. In general,
the preferred procedure is to drop one variable from each set.
Alternatively, two or more nominal sets can be combined into one
(e.g., sex and party identification can be combined into one set:
male Republican, male Democrat, male Independent, female
Republican, female Democrat, female Independent, from which set
we must drop one).
The omitted value in a dummy variable analysis is called the
reference category. For example, if we are analyzing the nominal
variable "party" in terms of dummy variables, such as Republican
(0 = unit is not Republican, 1 = unit is Republican), Democrat (0
= is not; 1 = is), and Other, we may choose not to include the
"Other" value. In this case "Other" is the reference category.
If the party dummy variables are the only ones in the
regression, then the constant (the y intercept) is the estimated
mean value of the dependent variable for the reference category,
"Other." The regression coefficients for the values Republican
and Democrat in the dummy variable analysis thus represent the
differences from the mean value associated with the reference
category. (That is, where the constant, c, is the estimate for
"other," the estimate for Republican is this constant plus the
regression coefficient for Republican). Dummy variable analysis
therefore includes information on the nature of the "omitted"
value.
MULTIPLE CORRELATION FOR ORDINAL VARIABLES
In addition to the procedure for breaking an ordinal
variable into a set of dummy variables for regression analysis,
other measures of multiple association exist when we are simply
interested in obtaining a measure for ordinal data akin to R-squared (i.e., without also coming up with prediction equations).
KENDALL'S COEFFICIENT OF CONCORDANCE, W
W is a kind of multiple correlation for ordinal data,
extending the function of rho and tau to more than two variables.
CURVILINEAR REGRESSION
Curvilinear regression can easily involve complexities too
advanced for treatment in this text, particularly when the
researcher wishes to determine the best-fitting curve (regression
line) for a set of data. However, the procedures for checking to
see whether or not certain common curvilinear relationships might
not have more explanatory power than those uncovered by linear
regression are relatively straightforward.
Testing for Curvilinearity
In discussing the correlation ratio (n), eta, we observed
that curvilinearity existed to the extent that n2 exceeded the
coefficient of determination, r2. But how much must this
difference be for us to conclude that curvilinear regression
would yield significantly better results than linear regression?
The answer is given by the F test of significance:
Note that the conventional procedure is to test for
curvilinearity of the relationship of the regression of the
dependent variable (y) predicted on the basis of the independent
variable (x), or "y on x" (byx) . This procedure uses nyx. It
would also be possible, however, to test for the curvilinearity
of "x on y" (using nxy) if this suited our research purposes.
Fitting a Curve
If one or more independent variables shows a significant
curvilinear relationship to the dependent variable in regression
analysis, a curved rather than a straight regression will usually
be preferable. One method for handling curvilinear relations has
already been discussed: the data may be transformed in a
curvilinear fashion, as by the use of logarithms.
An alternative method is to treat exponential and other
transformations of the independent variable(s) as additional
variables in regression analysis. Before discussing this method
it is necessary to describe briefly the nature of common
curvilinear regression equations.
A linear regression equation of the type discussed at the
beginning of the chapter takes the form:
y = bx+c
Here y is the dependent variable, x the independent, b the
regression coefficient, and c the constant.
A quadratic regression equation describes a curve with one
bend and takes the form:
y = b1x + b2x2 + c
Political scientists would rarely use a higher-order
regression equation, but for illustrative purposes, the
third-degree regression equation, called a cubic equation and
describing a curve with two bends, takes the following form:
y = b1x + b2x2 + b3x3 + c
The highest possible order of equation is (k-1), where k is
the number of values assumed by the independent variable, x (in
such an equation, the last term before the constant would be bk-1
xk-1). Eta-square may be thought of as the coefficient of
multiple determination, R-square, for the highest-possible-order
curvilinear regression.
How do we know which order polynomial to use? The rule is to
use the lowest one that explains all but an insignificant amount
of the explainable variance in the dependent variable. Ordinarily
this will be the quadratic order polynomial.
Procedure
To compute curvilinear regression, use the observed values
of x to generate x2, x3, and x4 (one could keep going on up to
x1, but in practice it is almost always wasted effort to worry
about relationships beyond the fourth level and, indeed, it is
rare to use anything beyond the second level in political
science). We may then treat the data for x2 as the data for a
second variable, the data for X3 as data for a third variable,
and the data for x4 as those for a fourth variable. Then stepwise
multiple regression can be performed on these four "variables,"
three of which are functions of the first.
For each step in stepwise regression we will get a value for
R2. When R2 increases only an insignificant amount, we know it is
not worth considering the variable added in that step. The
significance test for difference of R2 values was given earlier
in this chapter.
For multiple curvilinear regression, we also compute the
values for the powers of the additional variables. Since
quadratic (second-order) relations are generally the highest
considered in political science, we are usually using stepwise
regression to find the coefficients in the following equation, in
which y is the dependent variable and the xis are the independent
variables:
y = b1x1 + b2x2 + b3x2 + b4x22 +...+ b2nxn + c
As before, the values for the second power of the
independent variables are treated as additional variables.
Note that in addition to powers of the independent
variables, other transforms such as reciprocals or logarithms
might have been used.
Note also that the curve, whether linear or curvilinear,
will describe a line going through a set of dots representing
units of data in a way that maximizes R2. But such a line will be
more reliable for those portions of the curve that go through
many dots (are based on many units or observations) than for
those portions (usually the extremes) based on few units. The
curve has no necessary reliability for portions based on no units
(i.e., extrapolations outside the range of the observed data).
As mentioned above a major method for curvilinear regression
is to "straighten" the data first by use of a logarithmic
transformation.
MULTICOLLINEARITY IN CURVILINEAR REGRESSION
While by definition a given variable, x, and its powers, x2,
x3, . . . xn, will not be linearly related, the regression
technique outlined above may nonetheless involve some degree of
multicollinearity. An alternative procedure is to undertake sepa-
rate regressions using x, X2~ X n each in turn in order to assess
the relative effect of each power function on the prediction of
the dependent variable. Similarly, separate multiplicative
relations, xa, xb, . . ., xz, may be assessed in separate
regressions.
This separate assessment gives us an idea of the relative
predictive importance of the various power functions of a
variable in polynomial regression analysis. To assess whether the
addition of a power of a variable to a regression involving other
powers of that variable is useful, the F test of difference of
coefficients of multiple determination may be used, as discussed
earlier. To the extent that there is multicollinearity, however,
the relative sizes of the ß weights are unreliable guides to the
predictive importance of the various power functions of a given
variable. That must be assessed through separate regressions as
indicated above.
CANONICAL CORRELATION
Multiple regression and multiple correlation involve the
relation of several independent variables to one dependent
variable. Canonical correlation is an extension of multiple
correlation to cover the case of the relation of a set of several
independent variables to another set of several dependent vari-
ables. While its computation is too complex to detail here, its
rationale and interpretation can be set forth.
Using least-squares regression techniques, one can construct
a composite, that is, a linear function of a given set of
variables. Canonical correlation, R-squaredC, can be thought of
as the coefficient of multiple determination (R-squared) between
two composites. Paul Horst has extended this to multiple
canonical correlation, for the relationship among more than two
composites.
When canonical correlation is computed for a set of
independent and a set of dependent variables, several
coefficients are generated. The first one is the largest,
representing the percent of variance shared by the two
composites. This coefficient would be the same regardless of
which of the two sets of variables is considered the dependent
variable set. A second canonical correlation may then be computed
for the variance remaining after this first step, yielding a
second canonical correlation. Similarly, a third coefficient may
be computed for the variance remaining after the second step, and
so on.
Chi-square procedures exist for testing the significance of
each of these successively smaller R2 coefficients. Often only
the first coefficient is found to be significant, and in
political science it would be very rare to find more than the
third such coefficient to be significant.
The first canonical coefficient is interpreted as a
coefficient of shared variance of the independent and dependent
composites on the basis of the first source of variance in the
two sets. The second coefficient has the same interpretation for
the second source of variance, and so on. That is, the several
canonical correlation coefficients can be interpreted as a series
of successively less important dimensions of the relationship of
the two sets of variables.
For example, we might seek to compute canonical correlation
for a set of measures of political participation (the dependent
set: voter turnout by district, number of candidates for given
offices, proportion of citizens contributing to campaign costs,
etc.) and a set of measures of social characteristics (the
independent set: income, education, ethnic composition, etc., for
the districts in the study). We might find two significant
canonical correlation coefficients, the first equal to .60 and
the second equal to .30.
These would be interpreted as two dimensions (two sources)
of the variation between the two sets of variables. To name these
two dimensions we would have to look at the canonical vector
loadings (these correspond roughly to the regression
coefficients, though they must be interpreted with greater
caution). These loadings consist of a list of coefficients for
each of the variables for a given coefficient.
Just as the beta weights can be compared in multiple
regression analysis to assess the relative importance of
predictor variables, so the vector loadings for standardized data
can be compared to assess the relative importance of the
different variables for a given R2. By noting which variables are
loaded most heavily on the first R2 we are given a basis for
labelling this dimension (source) of variance.
Using this reasoning we might find, for instance, that for
the first R2c, the most heavily loaded independent variables were
income and occupational status, and the most heavily loaded
dependent variables were campaign contributions and extent of
political advertising in local media. On a nonmathematical basis
we might infer that this represented the economic dimension of
variance. Similarly, we might use the loadings on the second
canonical coefficient to interpret the second course of variance
for the two sets of variables.
Thus canonical correlation enables the researcher to obtain
a measure of the degree of association between two sets of
variables (usually conceived as an independent and dependent set)
and to do so for eaeh of the significant sources of variance for
the two sets. Uncovering a small number of sourees of variation
or dimensions that characterize a large number of variables is a
feature that canonical correlation has in common with factor
analysis, discussed in the next chapter.
Both procedures may be used in many cases. In canonical
correlation the researcher specifies in advance which variables
belong in which sets, and then seeks to measure the correlation
of these sets on one or more dimensions. In factor analysis, in
contrast, the researcher does not specify in advance which
variables belong in which sets.
Factor analysis is a procedure for uncovering the smallest
number of underlying dimensions (factors) that explain all but an
insignificant amount of the explainable variance of all the
variables. In this procedure the factor loadings (which
correspond in a loose way to vector loadings in canonieal corre-
lation and ~ weights in multiple regression) may be used as
criteria for grouping the variables into sets, with sets
corresponding to the factors.
doc:regress.txt