|
|
Overview
Other techniques, such as and Q-mode factor analysis, multidimensional scaling, and latent class analysis also perform clustering and are discussed separately. SPSS offers three general approaches to cluster analysis:
|
|
Failure to meet these criteria may indicate the researcher has requested too many or too few clusters, or possibly that an inappropriate distance measure (discussed below) has been selected. It is also possible that the hypothesized conceptual basis for clustering does not exist, resulting in arbitrary clusters.
One may wish to use the hierarchical cluster procedure on a sample of cases (ex., 200) to inspect results for different numbers of clusters. The optimum number of clusters depends on the research purpose. Identifying "typical" types may call for few clusters and identifying "exceptional" types may call for many clusters. After using hierarchical clustering to determine the desired number of clusters, the researcher may wish then to analyze the entire dataset with k-means clustering (aka, the Quick Cluster procedure: Analyze, Cluster, K-Means Cluster Analysis), specifying that number of clusters.
Move the variables desired to the variable list box. This example uses the SPSS example file judges.sav, where columns (variables) are judges from eight countries and rows are 300 fictional cases of gymnasts being rated on a 0-10 scale. To cluster judges, check Variables in the cluster group. Check if Statistics and/or Plots are desired.
Under the Methods button, one may request the cluster (linkage) method and the distance measure to be used. The distance measure choices will depend on the level of measurement specified: interval, count, or binary. It is also possible to standardize and transform variables at this point, though in the current example that is not needed as all variables are of the same 0 - 10 scale. When scale differs among variables, standardization is recommended.
In the figure above, points A, B, and C are cases 1, 2, and 3 in the tables below. Note B is 3 love units and 5 happiness units from A, and C is the same with respect to B.
Available alternatives are Euclidean distance, squared Euclidean distance, size difference, pattern difference, variance, dispersion, shape, simple matching, phi 4-point correlation, lambda, Anderberg's D, dice, Hamann, Jaccard, Kulczynski 1, Kulczynski 2, Lance and Williams, Ochiai, Rogers and Tanimoto, Russel and Rao, Sokal and Sneath 1, Sokal and Sneath 2, Sokal and Sneath 3, Sokal and Sneath 4, Sokal and Sneath 5, Yule's Y, and Yule's Q.
The proximity/distance/agglomeration coefficient in the "Coefficients" column is an indicator of how far the agglomeration algorithm has to reach to combine an existing cluster with the next closest cluster or variable (judge). For this example one can see that there is a large jump between stages 5 and 6, corresponding to combining cluster 1 (judges 2,5,7, and 1) with cluster 2 (judges 2, 4, and 6) from stage 5. A large agglomeration coefficient will correspond with a long distance in the dendogram discussed below. When there are relatively few cases, icicle plots or dendograms provide the same linkage information in an easier format.
In the figure above on 8 judges rating 300 objects, the agglomeration schedule shows, for instance, that judges 3 and 5 are combined in a cluster first (the cluster is labeled 3). judges 2 and 4 become cluster 2. Then judge 6 is added to cluster 2. Then at stage 4, the new cluster 3 formed at stage 1 is combined with judge 7 to form a larger cluster, also now labeled 3. Then cluster 3 is joined to judge 1 and is labeled cluster 1. Then cluster 2 is joined to cluster 1 and is labeled cluster 1. Finally, judge 8 (the "enthusiast" judge, who is most different from others) is joined to cluster 1, which then is the only remaining cluster.
In the figure above, from hierarchical cluster analysis on 8 judges who rated 300 objects, the dendogram shows judges 3 & 5 (these were Romania and China respectively) to be in one of the two earliest clusters, with judge 7 (Russia) affiliated with cluster 3 & 5 only at a greater distance. In general, the dendogram shows the pattern of clustering among the judges, with connecting lines further to the right indicating more distance between judges and clusters. The final linkage to judge 8 ("Enthusiast") shows ths judge to be least like the others, but the real jump occurs a step earlier, as noted in the section above regarding the agglomeration schedule.
One can also cluster cases. The dendogram below is for the clustering of 50 objects by the 8 judgest, with objects 10, 38, 17, 16, 18, 43, 2, 46, and 27 forming one of the first clusters:
In the figure above, from hierarchical cluster analysis on 8 judges who rated 300 objects, the vertical icicle plot shows what happens when there are the following number of clusters:
K-means cluster analysis uses Euclidean distance. The researcher must specify in advance the desired number of clusters, K. Initial cluster centers are chosen randomly in a first pass of the data (note different initial values may affect the solution: see Assumptions section on randomization), then each additional iteration groups observations based on nearest Euclidean distance to the mean of the cluster. That is, the algorithm seeks to minimize within-cluster variance and maximize variability between clusters in an ANOVA-like fashion. Cluster centers change at each pass. The process continues until cluster means do not shift more than a given cut-off value or the iteration limit is reached.
Cluster relationship with other variables. The relationship of any variable in the dataset with the clusters formed by the clustering variables can be viewed (among other ways) by selecting Analyze, Descriptive Statistics, Crosstabs, with QCL_1 as rows and that variable as columns. Needless to say, that variable need not have been one of the clustering variables.
There are three statistics options:
In addition, there are two missing values options: listwise (the default) and pairwise deletion of cases with missing values.
In the figure above, the 8 judges (7 nations plus "Enthusiast" are the "variables") rating 300 athletes, the ANOVA table shows the largest error associated with the "Enthusiast" judge, meaning that judge (variable) is least helpful in forming and differentiating the clusters. All judges/variables are significant, but this is largely meaningless. The ANOVA table is used mainly to look at the size of the mean square errors.
Getting different clusters. Sometimes the researcher wishes to experiment to get different clusters, as when the "Number of cases in each cluster" table shows highly imbalanced clusters and/or clusters with very few members. Different results may occur by setting different initial cluster centers from file (see above), by changing the number of clusters requested, or even by presenting the data file in different case order.
Warning! The CF tree and hence the clustering solution will be affected by the order of the data. See the Assumptions section on randomization, which is strongly recommended.
In the example above, by the BIC criterion alone one would select 4 clusters as being optimal, since the lowest BIC coefficient is the best model. By the SPSS default algorithm, 4 clusters are also selected because this yields a large BIC ratio of change and a large ratio of distances. Note the SPSS algorithm need not agree with the BIC criterion used alone, though it does in this example. When it differs, in essence the SPSS algorithm judges that the gain in information from having more than the number of clusters specified by BIC alone is not worth the increased complexity (diminution of parsimony) of the model. The researcher has the option to override this default and specify 6 or some other number of clusters.
In the example above, automobiles from America, Europe, and Japan were clustered on various attributes (ex., engine size), deriving two clusters. US cars with large engine size dominate the first cluster.
The examples below show variablewise importance plots for the cars example, which included both continuous (top figure) and categorical (bottom figure) variables. The top figure, below, shows that cluster 2, which is the smaller and predominantly European and Japanese cars, is differentiated by the top three variables in a negative direction and by the bottom three variables in a positive direction. The negative factors contribute more to differentiating cluster 2 than the positive ones.
The second plot, below, shows that both categorical variables, country and number of cylinders, differentiate the cars in Cluster 2.
SAS syntax for hierarchical cluster analysis of the example dataset is shown below:
PROC IMPORT OUT= WORK.cluster1
DATAFILE= "path to judges_flipped.sav goes here"
DBMS=SPSS REPLACE;
RUN;
TITLE "PROC CLUSTER with Average Distance Linking" JUSTIFY=CENTER;
PROC CLUSTER METHOD=AVERAGE OUTTREE=tree;
RUN;
PROC TREE DATA=tree NCLUSTERS=2 HORIZONTAL OUT=outfile;
RUN;
PROC PRINT DATA=outfile;
RUN;
The syntax above is interpreted as follows:
Cluster History
Norm T
RMS i
NCL --Clusters Joined--- FREQ Dist e
7 OB3 OB5 2 0.4027
6 OB2 OB4 2 0.43
5 CL6 OB6 3 0.5433
4 CL7 OB7 3 0.5619
3 OB1 CL4 4 0.7125
2 CL3 CL5 7 1.1219
1 CL2 OB8 8 1.1835
NCLUSTERS=2 specifies the number of clusters to which to assign cases. HORIZONTAL in the PROC TREE statement overrides the default vertical positioning of dendogram bars. OUT=outfile creates a working dataset called "outfile," the printout of which is shown below.
Obs _NAME_ CLUSTER CLUSNAME
1 OB3 1 CL2
2 OB5 1 CL2
3 OB2 1 CL2
4 OB4 1 CL2
5 OB6 1 CL2
6 OB7 1 CL2
7 OB1 1 CL2
8 OB8 2 OB8
DATA 'c:\data\judges_clusters'; set outfile; RUN;
PROC IMPORT OUT= WORK.fastclus1
DATAFILE= "\\tsclient\C\Docs\David\Courses\PA765\Datasets\SPSS Samples\judges.sav"
DBMS=SPSS REPLACE;
RUN;
TITLE "PROC FASTCLUS Example" JUSTIFY=CENTER;
PROC FASTCLUS DATA=fastclus1 OUT=clusters MAXCLUSTERS=2 MAXITER=100 LIST;
VAR judge1-judge8;
RUN;
PROC PRINT DATA=clusters;
RUN;
Explanation of FASTCLUS syntax:
Cluster Listing
Distance
from
Obs Cluster Seed
_____________________________
1 1 1.8672
2 2 0.9011
3 1 2.1870
4 1 1.8102
5 2 1.9251
... ... ...
296 2 1.3450
297 2 1.2042
298 1 0.5907
299 1 1.7255
300 2 1.1923
Cluster Summary
Maximum Distance
RMS Std from Seed Radius Nearest Distance Between
Cluster Frequency Deviation to Observation Exceeded Cluster Cluster Centroids
__________________________________________________________________________________________________
1 147 0.5559 2.5209 2 3.8989
2 153 0.5435 2.4249 1 3.8989
Statistics for Variables
Variable Total STD Within STD R-Square RSQ/(1-RSQ)
__________________________________________________________________
judge1 0.87867 0.50550 0.670132 2.031520
judge2 0.86534 0.51947 0.640834 1.784231
judge3 0.84560 0.48751 0.668728 2.018671
judge4 0.71206 0.44652 0.608088 1.551593
judge5 0.69246 0.41578 0.640676 1.783004
judge6 0.99648 0.57462 0.668583 2.017351
judge7 0.99411 0.55932 0.684499 2.169563
judge8 1.00803 0.79870 0.374305 0.598222
OVER-ALL 0.88175 0.54957 0.612821 1.582786
Pseudo F Statistic = 471.67
Approximate Expected Over-All R-Squared = 0.12777
Cubic Clustering Criterion = 116.387
WARNING: The two values above are invalid for correlated variables.
If variables are not correlated, the approximate expected overall R-squared and the cubic clustering criterion (CCC) coefficient may be used when comparing the 2-cluster solution with other solutions, with higher coefficients being better. By rule of thumb, CCC > 3 (some say 2) indicates a well-fitting cluster model. Negative CCC usually indicates the presence of outliers. For this example, however, variables are correlated and so over-all R-Squared and CCC would not be reported.
Obs judge1 judge2 judge3 judge4 judge5 judge6
1 7.10 7.20 7.00 7.70 7.10 7.10
2 9.30 9.70 8.90 9.60 8.60 9.50
3 8.90 8.80 8.10 9.30 8.50 8.10
...
Obs judge7 judge8 CLUSTER DISTANCE
1 7.00 7.30 1 1.86724
2 9.60 9.70 2 0.90105
3 7.60 8.70 1 2.18702
...
PROC FREQ DATA=clusters;
TABLES Anyvarname*Cluster;
RUN;
PROC IMPORT OUT= WORK.varclus1
DATAFILE= "C:\Data\GSS93 subset.sav"
DBMS=SPSS REPLACE;
RUN;
TITLE "PROC VARCLUS Example" JUSTIFY=CENTER;
PROC VARCLUS DATA=varclus1 MAXEIGEN = .7 TRACE OUTTREE=tree MAXCLUSTERS=4;
VAR bigband--hvymetal;
RUN;
PROC TREE DATA=tree HORIZONTAL NCLUSTERS=4 OUT=outfile;
RUN;
PROC PRINT DATA=outfile;
RUN;
The syntax above is interpreted as follows (note PROC VARCLUS has many more options than utilized in this example):
Cluster Summary for 4 Clusters
Cluster Variation Proportion Second
Cluster Members Variation Explained Explained Eigenvalue
________________________________________________________________________
1 5 5 2.786607 0.5573 0.7235
2 2 2 1.559927 0.7800 0.4401
3 2 2 1.425733 0.7129 0.5743
4 2 2 1.350464 0.6752 0.6495
Total variation explained = 7.122731 Proportion = 0.6475
R-squared with 4 Clusters
Own Next 1-R**2 Variable
Cluster Variable Cluster Closest Ratio Label
_______________________________________________________________________________
Cluster 1 bigband 0.5049 0.1077 0.5549 Bigband Music
musicals 0.6333 0.0799 0.3985 Broadway Musicals
classicl 0.6511 0.0817 0.3799 Classical Music
folk 0.4123 0.1270 0.6731 Folk Music
opera 0.5850 0.0714 0.4469 Opera
-------------------------------------------------------------------------------
Cluster 2 blues 0.7800 0.0919 0.2423 Blues or R & B Music
jazz 0.7800 0.1068 0.2463 Jazz Music
-------------------------------------------------------------------------------
Cluster 3 blugrass 0.7129 0.0786 0.3116 Bluegrass Music
country 0.7129 0.0047 0.2885 Country Western Music
-------------------------------------------------------------------------------
Cluster 4 rap 0.6752 0.0362 0.3370 Rap Music
hvymetal 0.6752 0.0117 0.3286 Heavy Metal Music
Standardized Scoring Coefficients
Cluster 1 2 3 4
___________________________________________________________________________________________
bigband Bigband Music 0.254980 0.000000 0.000000 0.000000
blugrass Bluegrass Music 0.000000 0.000000 0.592197 0.000000
country Country Western Music 0.000000 0.000000 0.592197 0.000000
blues Blues or R & B Music 0.000000 0.566152 0.000000 0.000000
musicals Broadway Musicals 0.285581 0.000000 0.000000 0.000000
classicl Classical Music 0.289571 0.000000 0.000000 0.000000
folk Folk Music 0.230437 0.000000 0.000000 0.000000
jazz Jazz Music 0.000000 0.566152 0.000000 0.000000
opera Opera 0.274473 0.000000 0.000000 0.000000
rap Rap Music 0.000000 0.000000 0.000000 0.608476
hvymetal Heavy Metal Music 0.000000 0.000000 0.000000 0.608476
Cluster Structure
Cluster 1 2 3 4
___________________________________________________________________________________________
bigband Bigband Music 0.710528 0.328180 0.229415 -.079317
blugrass Bluegrass Music 0.280356 0.169543 0.844314 -.012270
country Country Western Music 0.062101 -.028955 0.844314 -.068802
blues Blues or R & B Music 0.303215 0.883155 0.150752 0.157381
musicals Broadway Musicals 0.795803 0.282606 0.099232 -.034073
classicl Classical Music 0.806920 0.285795 0.022910 0.011338
folk Folk Music 0.642139 0.157035 0.356342 -.058354
jazz Jazz Music 0.326757 0.883155 -.003696 0.163230
opera Opera 0.764848 0.267163 0.099164 0.051174
rap Rap Music 0.016384 0.190224 -.029399 0.821725
hvymetal Heavy Metal Music -.059234 0.108086 -.049504 0.821725
Inter-Cluster Correlations
Cluster 1 2 3 4
1 1.00000 0.35666 0.20280 -0.02607
2 0.35666 1.00000 0.08326 0.18151
3 0.20280 0.08326 1.00000 -0.04801
4 -0.02607 0.18151 -0.04801 1.00000
Total Proportion Minimum Maximum Minimum Maximum
Number Variation of Variation Proportion Second R-squared 1-R**2 Ratio
of Explained Explained Explained Eigenvalue for a for a
Clusters by Clusters by Clusters by a Cluster in a Cluster Variable Variable
______________________________________________________________________________________________
1 3.275665 0.2978 0.2978 1.660727 0.0003
2 4.669804 0.4245 0.4183 1.456046 0.0384 0.9645
3 5.954356 0.5413 0.4355 1.174827 0.2198 0.7829
4 7.122731 0.6475 0.5573 0.723516 0.4123 0.6731
Obs _NAME_ CLUSTER CLUSNAME
1 bigband 1 Clus4
2 blugrass 2 Clus5
3 country 2 Clus5
4 blues 3 Clus6
5 musicals 1 Clus4
6 classicl 1 Clus4
7 folk 1 Clus4
8 jazz 3 Clus6
9 opera 1 Clus4
10 rap 4 Clus7
11 hvymetal 4 Clus7
This assumption can be tested by creating a bivariate correlation matrix of the continuous variables to see if the correlations are not significantly different from zero Likewise, crosstabulation can establish if any pair of categorical variables is not significantly related (chi-square test). To establish the independennce of a continuous and categorical variable, the SPSS Means procedure may be used. Two-step clustering is fairly robust even when the assumption of independence is violated.
In the example, judge 8 was not a country representative like the other seven but instead was an enthusiast/fan. With unstandardized data, the enthusiast judge stood out in a separate cluster by himself, reflecting reality. When data are standardized so the enthusiast judge has the same mean ratings as other judges and the same variance, the enthusiast judge is not in a separate cluster of his own but is clustered with a number of other countries, in the two-cluster solution. This might lead to naive analyst to think the enthusiast judge was similar in athletic ratings to the ratings of other judges in the cluster (judges 1, 3, 5, and 7). This would not reflect reality but rather what would happen in a hypothetical world in which all judges had the same mean ratings and same variances. If the researcher wants the effect of each judge to be equal, standardize. If the researcher wants results to reflect real-world differences in means and variances, do not standardize.
INTERVAL
Here The cosine distance of points B and C (which are cases 2 and 3) with respect to the origin (point A) is 0, since they are on a line from the origin. However, the vectors from point B to point C, and from point C to point B, are in opposite directions and hence have a cosine distance of 1.
COUNTS
BINARY
Distance is based on the means of the variables used to cluster the cases. If all variables are entered as continuous, then Euclidean (straight line) distance is the distance measure used. If there are one or more categorical variables, then in SPSS autoclustering, likelihood distance (also called log-likelihood or maximum likelihood distance) is used. Likelihood distance reflects the drop in the log likelihood statistic when clusters are combined. Likelihood distance assumes a normal distribution for continuous variables and a multinomial distribution for categorical variables. See SPSS Inc. (2001) for further discussion.
Copyright 1998, 2008, 2009, 2010, 2012 by G. David Garson.
Last updated 1/12/2012.