|
|
In the example below, dropping any item will lower alpha. This is especially true of "RS Highest Degree", which in the table further below is shown correlated at the .872 level with "Highest Year of School Completed." Correlations over .80 may signal multicollinearity, which in turn might wrongly lead the researcher to drop "RS Highest Degree" from the scale, in turn leading to the conclusion that the remaining three items do not constitute a scale suitable even for exploratory purposes (since .539 is below the .60 cutoff level). However, high correlation of the constituent items of a scale is not considered multicollinearity because the scale score, not the separate items scores, will appear in the regression (or other) analysis.
In SPSS, select Analyze, Scale, Reliability Analysis; list your variables; click Statistics; select Item Scale, Scale if Item Deleted; select Split-Half from the Model drop-down list. OK. SPSS will take the first half of the items as the first split form, and the second half as listed in the dialog box as the second split form. If there are an odd number of items, the first form will be one item longer than the second form. You can also use the Paste button to call up the Syntax window and alter the /MODEL=SPLIT parameter to be /MODEL=SPLIT n, where n is the number of items in the second form.
In the example above, SPSS has divided the four-item education scale into two subscales. As shown in the table footnotes, the first subscale is Highest Year of School completed plus Father's Highest Degree. The second subscale is Mother's highest Degree and Respondent's Highest Degree. Comparing scores on these two subscales yields a Spearman-Brown reliability coefficient of .915. On a split-half basis, the researcher concludes the 4-item education scale is reliable.
The Pearson correlation of split forms estimates the half-test reliability of an instrument or scale. The Spearman-Brown "prophecy formula" predicts what the full-test reliability would be, based on half-test correlations. This coefficient will be higher than the half-test reliability coefficient. This coefficient is usually equal to and easily hand-calculated as twice the half-test correlation divided by the quantity 1 plus the half-test reliability. In SPSS, two Spearman-Brown split-half reliability coefficients will appear in the "Reliability Statistics" portion of the output when split-half is selected under the Model button: (1) "Equal length" gives the estimate of the reliability if both halves had equal numbers of items, and (2) "Unequal length" gives the reliability estimate assuming unequal numbers.
DATASET ACTIVATE DataSet1.
RELIABILITY
/VARIABLES=educ padeg madeg degree
/SCALE('ALL VARIABLES') ALL
/MODEL=SPLIT
/STATISTICS=SCALE ANOVA.
By default, the SPSS split half algorithm makes the first two items (educ, padeg) into subscale 1 and makes the last two items (madeg, degree) into subscale 2. To convert to odd-even split half format, the syntax of the /VARIABLES statement must be changed as illustrated below:
/VARIABLES=educ madeg padeg degreeNote that the order of the variables has been rearranged to list the 1st and 3rd variables first, followed by the 2nd and 4th. This will make subscale 1 be the odd items and subscale 2 be the even items.
As a result, the Spearman Brown split half reliability coefficient will differ, but the ANOVA table for the model as a whole will not change.
Test-retest methods are disparaged by many researchers as a way of gauging reliability. Among the problems are that short intervals between administrations of the instrument will tend to yield estimates of reliability which are too high. There may be invalidity due to a learning/practice effect (subjects learn from the first administration and adjust their answers on the second). There may be invalidity due to a maturation effect when the interval between administrations is long (the subjects change over time). The bother of having to take a second administration may cause some subjects to drop out of the pool, leading to nonresponse biases. Note, however, that test-retest designs are still widely used and published and there is support for this. McKelvie (1992), for instance, reports that reliability estimates under test-retest designs are not inflated due to memory effects. Researchers using test-retest reliability must address the special validity concerns, but may decide to go ahead if warranted.
Counts in diagonal cells will reflect inter-rater agreement and cells off the diagonal will represent disagreements. Kappa is a function of the ratio of agreements to disagreements in relation to expected frequencies. In SPSS it is not available in the Reliability module. Rather one must obtain it from the Crosstabs procedure (Kappa is a choice under the Statistics button in Crosstabs; it is not a default option). In SAS, weighted and unweighted kappa is computed by the FREQ procedure.
Interpretation. By convention, a Kappa > .70 is considered acceptable inter-rater reliability, but this depends highly on the researcher's purpose. Another rule of thumb is that K = 0.40 to 0.59 is moderate inter-rater reliability, 0.60 to 0.79 substantial, and 0.80 outstanding (Landis & Koch, 1977). For inter-rater reliability of a set of items, such as a scale, one would report mean Kappa.
Manual computation: let a = the sum of counts on the diagonal, reflecting agreements. Let e = the sum of expected counts on the diagonal, where expected is calculated as [(row total * column total)/n], summed for each cell on the diagonal. Let n = the total number of ratings (observations). Kappa then equals the ratio of the surplus of agreements over expected agreements, divided by the number of expected disagreements. This is equivalent to K = (a - e)/(n - e). Fleiss and Cohen (1973) have shown ICC, discussed below, is mathematically equivalent to weighted Kappa.
Weighted Kappa: For ordinal rankings or better, one can weight each cell in the agreement/disagreement table by a weight between 0 and 1, where 1 corresponds to the row and column categories being the same and 0 corresponds to the categories being maximally dissimilar.
Sample size: ICC vs. Pearson r: When there are just two ratings, ICC is preferred over Pearson's r only when sample size is small (< 15). Since Pearson's r makes no assumptions about rater means, a t-test of the significance of r reveals if inter-rater means differ. For small samples (< 15), Pearson's r overestimates test-retest correlation and in this situation, intraclass correlation is used instead of Pearson's r. Walter, Eliasziw, & Donner (1998) set optimal sample size for ICC based on desired power level, magnitude of the predicted ICC, and the lower confidence limit, concluding that if the researcher used the customary .95 confidence level and the .80 power level, and had two ratings per subject, then the needed sample size (needed to prove the estimated ICC was different from 0) would range from 5 when the estimated ICC was .9 to 616 when it was only .1; for three ratings, the corresponding range was 3 to 225; for four ratings, 3 to 123; for five ratings, 3 to 81; for 10 ratings, 2 to 26; for 20 ratings, 2 to 11 (pp. 106-107). Bonnett (2002: 1334) investigated the sample size issue for ICC, concluding that optimum sample size is a function of the size of the intraclass correlation coefficient and the number of ratings per subject, as well as the desired significance level (alpha) and desired width (w) of the confidence interval. For alpha = .95 and w=.2, Bonnett concluded that the optimal sample size for two ratings varied from 15 for ICC=.9 to 378 for ICC = .1; for three ratings, it varied from 13 to 159; five ratings, 10 to 64; and 10 ratings, 8 to 29. That is, the fewer ratings and the smaller the ICC level, the larger the needed sample size. For this example, with 906 people rating 7 items, described above, sample size is more than adequate.
Data setup: In using intraclass correlation for inter-rater reliability, one constructs a table in which optionally column 1 is the target id (1, 2, ..., n) and subsequent columns are the raters (A, B, C, ...). It may be necessary to transpose the data (Data, Transpose in the SPSS menus) to make the raters be the columns, as was done below for the example data on 906 respondents rating TV shows on 7 items.
The row variable is the target of the ratings. The target might be attributes (in this example the target is TV show attributes), or it might be persons who are rated (Subject1, Subject2, etc.) or neighborhoods which are rated (E, W, N, S), for instance. The cell entries after the first id column are the raters' ratings of the target on some interval variable or interval-like variable, such as some Likert scale, or, in this example, a binary 0/1 scale. The purpose of ICC is to assess the inter-rater (column) effect in relation to the grouping (row) effect, using two-way ANOVA.
Interpretation: ICC is interpreted similar to Kappa, discussed above. ICC will approach 1.0 when there is no variance within targets (for any target, all raters give the same ratings), indicating total variation in measurements is due solely to the target (ex., TV attribute) variable. That is, ICC will be high when any given row tends to have the same score across the columns (which are the raters). For instance, one may find all raters rate an item the same way for a given target, indicating total variation in the measure of a variable depends solely on the values of the variable being measured -- that is, there is perfect inter-rater reliability. Put another way, ICC may be thought of as the ratio of variance explained by the independent variable divided by total variance, where total variance is the explained variance plus variance due to the raters plus residual variance. ICC is 1.0 only when there is no variance due to the raters and no residual variance to explain.
In SPSS, select Analyze, Scale/Reliability Analysis; select your variables; click Statistics; in the Descriptives group, select Item and select Intraclass correlation coefficient.; select a model from the Model drop-down list (ex., One-way random); select a type from the Type drop-down list (ex., consistency). Continue. OK. Models and Types are discussed below.
Single versus average measures: Each model has two versions of the intraclass correlation coefficient:
Average measure reliability is close to Cronbach's alpha. Average measure reliability for either two-way random effects or two-way mixed models will be the same as Cronbach's alpha. In this example, for the one-way random model, the ICC and Cronbach's alpha differ, but not greatly.
Average measure reliability requires a reasonable number of judges to form a stable average. The number of judges required is estimated beforehand as nj = ICC*(1 - rl)/rl( 1 - ICC*), where nj is the number of judges needed, rl is the lower bound from the (1-a)*100% confidence interval around the ICC, discovered in a pilot study; and ICC* is the minimum level of ICC acceptable to the researcher (ex., .80).
Models: ICC varies depending on whether the judges are all judges of interest or are conceived as a random sample of possible judges, and whether all targets are rated or only a random sample, and whether reliability is to be measured based on individual ratings or mean ratings of all judges. These considerations give rise to six forms of intraclass correlation, described in the classic article by Shrout and Fleiss (1979). In SPSS, these types are selected under the Model button of the Reliability dialog and under the Type drop-down list (3 models times 2 types = the six forms of ICC). .
Types: Under the Model button of the SPSS Reliability dialog, the Type drop-down list allows the researcher to specify one of two types of ICC computation:
Use in other contexts. ICC is sometimes used outside the context of inter-rater reliability. In general, ICC is a coefficient which approaches 1.0 as the between-groups effect (the row effect) is very large relative to the within-groups effect (the column effect), whatever the rows and columns represent. In this way ICC is a measure of homogeneity: it approaches 1.0 when any given row tends to have the same values for all columns. For instance, let columns be survey respondents and let rows be Census block numbers, and let the attribute measured be white=0/nonwhite=1. If blocks are homogenous by race, any given row will tend to have mostly 0's or mostly 1's, and ICC will be high and positive. As a rule of thumb, when the row variable is some grouping or clustering variable, such as Census areas, ICC will more and more approach 1.0 as the size of the clusters decreases and becomes more compact (ex., as one goes from metropolitan statistical areas to Census tracts to Census blocks). ICC is 0 when within-groups variance equals between-groups variance, indicative of the grouping variable having no effect. Though less common, note that ICC can become negative when the within-groups variance exceeds the between-groups variance.
If Tukey's test shows multiplicative interaction, any model computing scores for cases based on the scale must include the case main effect, the item main effect, and the case-by-item interaction effect. In a footnote to the Tukey test output, SPSS prints an estimates of the power to which items in a set would need to be raised in order to be additive. (Warning: while transforms may eliminate non-additivity, raising item scores to too high a power will generate large values for all subjects, obscuring differences among subjects).
In SPSS, select Analyze, Scale, Reliability Analysis; click Statistics; check Tukey's test of additivity. The output below is General Social Survey data for the four education items in illustrations above. Since Tukey's test is significant, multiplicative interaction is indicated for these data.
The Spearman correction for attenuation of a correlation: let rxy* be corrected r for the correlation of x and y; let rxy be the uncorrected correlation; then rxy* is a function of the reliabilities of the two variables, rxx and ryy:
This formula will result in an estimated true correlation ( rxy*) which is higher than the observed correlation (rxy), and all the more so the lower the reliabilities. Corrected r may be greater than 1.0, in which case it is customarily rounded down to 1.0.
Note that use of attenuation-corrected correlation is the subject of controversy (see, for ex., Winne & Belfry, 1982). Moreover, because corrected r will no longer have the same sampling distribution as r, a conservative approach is to take the upper and lower confidence limits of r and compute corrected r for both, giving a range of attenuation-corrected values for r. However, Muchinsky (1996) has noted that attenuation-corrected reliabilities, being not directly comparable with uncorrected correlation, are therefore not appropriate for use with inferential statistics in hypothesis testing and this would include taking confidence limits. Still, Muchinsky and others acknowledge that the difference between a correlation and attenuation-corrected correlation may be useful, at least for exploratory purposes, in assessing whether a low correlation is low because of unreliability of the measures or because the measures are actually uncorrelated.
One situation in which negative reliability might occur is when the scale items represent more than one dimension of meaning, and these dimensions are negatively correlated, and one split half test is more representative of one dimension while the other split half is more representative of another dimension. As Krus & Helmstadter point out, factor analyzing the entire set of items first would reveal if the set of items is plausibly conceptualized as unidimensional.
A second scenario for negative reliability is discussed by Magnusson (1966: 67), who notes that when true reliability approaches zero and sample size is small, random disturbance in the data may yield a small negative reliability coefficient.
In the case of Cronbach's alpha, Nichols (1999) notes that values less than 0 or greater than 1.0 may occur, especially when the number of cases and/or items is small. Negative alpha indicates negative average covariance among items, and when sample size is small, misleading samples and/or measurement error may generate a negative rather than positive average covariance. The more the items measure different rather than the same dimension, the greater the possibility of negative average covariance among items and hence negative alpha.
In SPSS, select Analyze, Scale/Reliability; select your items; click Statistics; in the Descriptives area, select Item, Scale, Scale if Deleted; in Summarize, select summary statistics (Means, Variances, Covariances, Correlations); and in the ANOVA table group, select Cochran chi-square. Continue. OK.
Cochran's Q is discussed further in the section on significance tests for more than two dependent samples.
Derivation of the ICC formula, following Ebel (1951: 409-411): Let A be the true variance in subjects' ratings due to the normal expectation that different subjects will have true different scores on the rating variable. Let B be the error variance in subjects' ratings attributable to inter-rater unreliability. The intent of ICC is to form the ratio, ICC = A/(A + B). That is, intraclass correlation is to be true inter-subject variance as a percent of total variance, where total variance is true variance plus variance attributable to inter-rater error in classification. B is simply the mean-square estimate of within-subjects variance (variance in the ratings for a given subject by a group of raters), computed in ANOVA. The mean-square estimate of between-subjects variance equals k times A (the true component) plus B (the inter-rater error component), since each mean contains a true component and an error component.
Given B = mswithin, and given msbetween = kA + B, substituting these equalities into the intended equation (ICC = A/[A+B]), the equation for ICC reduces to the formula for the most-used version of intraclass correlation (Haggard, 1958: 60):
Copyright 1998, 2008, 2009, 2010 by G. David Garson.
Last updated 1/30/2010.