|
|
Overview
Other techniques, such as and Q-mode factor analysis, multidimensional scaling, and latent class analysis also perform clustering and are discussed separately. SPSS offers three general approaches to cluster analysis:
|
|
Failure to meet these criteria may indicate the researcher has requested too many or too few clusters, or possibly that an inappropriate distance measure (discussed below) has been selected. It is also possible that the hypothesized conceptual basis for clustering does not exist, resulting in arbitrary clusters.
One may wish to use the hierarchical cluster procedure on a sample of cases (ex., 200) to inspect results for different numbers of clusters. The optimum number of clusters depends on the research purpose. Identifying "typical" types may call for few clusters and identifying "exceptional" types may call for many clusters. After using hierarchical clustering to determine the desired number of clusters, the researcher may wish then to analyze the entire dataset with k-means clustering (aka, the Quick Cluster procedure: Analyze, Cluster, K-Means Cluster Analysis), specifying that number of clusters.
Move the variables desired to the variable list box. This example uses the SPSS example file judges.sav, where columns (variables) are judges from eight countries and rows are 300 fictional cases of gymnasts being rated on a 0-10 scale. To cluster judges, check Variables in the cluster group. Check if Statistics and/or Plots are desired.
Under the Methods button, one may request the cluster (linkage) method and the distance measure to be used. The distance measure choices will depend on the level of measurement specified: interval, count, or binary. It is also possible to standardize and transform variables at this point, though in the current example that is not needed as all variables are of the same 0 - 10 scale. When scale differs among variables, standardization is recommended. Linkage method and distance measure options are discussed in the introductory section above.
There are a variety of different measures of inter-observation distances and inter-cluster similarities and distances to use as criteria when merging nearest clusters into broader groups or when considering the relation of a point to a cluster. Distance measures how far apart two observations are. Cases which are alike share a low distance. Similarity measures how alike two cases are. However, it is common to refer to all measures as "distance" measures since the same function is served. Note that when two or more variables are used to define distance, the one with the larger magnitude will dominate. To avoid this it is common to first standardize all variables. SPSS hierarchical clustering supports these types of measures:
Available alternatives are Euclidean distance, squared Euclidean distance, size difference, pattern difference, variance, dispersion, shape, simple matching, phi 4-point correlation, lambda, Anderberg's D, dice, Hamann, Jaccard, Kulczynski 1, Kulczynski 2, Lance and Williams, Ochiai, Rogers and Tanimoto, Russel and Rao, Sokal and Sneath 1, Sokal and Sneath 2, Sokal and Sneath 3, Sokal and Sneath 4, Sokal and Sneath 5, Yule's Y, and Yule's Q.
INTERVAL
BINARY
li> Kulczynski 1 is the ratio of joint presences to all nonmatches. Its lower bound is 0 and it is unbounded above. It is theoretically undefined when there are no nonmatches; however, the software assigns an arbitrary value of 9999.999 when the value is undefined or is greater than this value.
The proximity/distance/agglomeration coefficient in the "Coefficients" column is an indicator of how far the agglomeration algorithm has to reach to combine an existing cluster with the next closest cluster or variable (judge). For this example one can see that there is a large jump between stages 5 and 6, corresponding to combining cluster 1 (judges 2,5,7, and 1) with cluster 2 (judges 2, 4, and 6) from stage 5. A large agglomeration coefficient will correspond with a long distance in the dendogram discussed below. When there are relatively few cases, icicle plots or dendograms provide the same linkage information in an easier format.
In the figure above on 8 judges rating 300 objects, the agglomeration schedule shows, for instance, that judges 3 and 5 are combined in a cluster first (the cluster is labeled 3). judges 2 and 4 become cluster 2. Then judge 6 is added to cluster 2. Then at stage 4, the new cluster 3 formed at stage 1 is combined with judge 7 to form a larger cluster, also now labeled 3. Then cluster 3 is joined to judge 1 and is labeled cluster 1. Then cluster 2 is joined to cluster 1 and is labeled cluster 1. Finally, judge 8 (the "enthusiast" judge, who is most different from others) is joined to cluster 1, which then is the only remaining cluster.
In the figure above, from hierarchical cluster analysis on 8 judges who rated 300 objects, the dendogram shows judges 3 & 5 (these were Romania and China respectively) to be in one of the two earliest clusters, with judge 7 (Russia) affiliated with cluster 3 & 5 only at a greater distance. In general, the dendogram shows the pattern of clustering among the judges, with connecting lines further to the right indicating more distance between judges and clusters. The final linkage to judge 8 ("Enthusiast") shows ths judge to be least like the others, but the real jump occurs a step earlier, as noted in the section above regarding the agglomeration schedule.
One can also cluster cases. The dendogram below is for the clustering of 50 objects by the 8 judgest, with objects 10, 38, 17, 16, 18, 43, 2, 46, and 27 forming one of the first clusters:
In the figure above, from hierarchical cluster analysis on 8 judges who rated 300 objects, the vertical icicle plot shows what happens when there are the following number of clusters:
K-means cluster analysis uses Euclidean distance. The researcher must specify in advance the desired number of clusters, K. Initial cluster centers are chosen randomly in a first pass of the data (note different initial values may affect the solution: see Assumptions section on randomization), then each additional iteration groups observations based on nearest Euclidean distance to the mean of the cluster. That is, the algorithm seeks to minimize within-cluster variance and maximize variability between clusters in an ANOVA-like fashion. Cluster centers change at each pass. The process continues until cluster means do not shift more than a given cut-off value or the iteration limit is reached.
In SPSS, Analyze, Cluster, K-Means Cluster Analysis; enter variables in the Variables: area; optionally, enter a variable in the "Label cases by:" area; enter "Number of clusters:"; choose Method: Iiterate and classify, or just Classify). Unlike hierarchical clustering, there is no option for "Range of solutions"; instead the researcher desiring to do so must re-run K-means clustering, asking for a different number of clusters.
Cluster relationship with other variables. The relationship of any variable in the dataset with the clusters formed by the clustering variables can be viewed (among other ways) by selecting Analyze, Descriptive Statistics, Crosstabs, with QCL_1 as rows and that variable as columns. Needless to say, that variable need not have been one of the clustering variables.
There are three statistics options:
In addition, there are two missing values options: listwise (the default) and pairwise deletion of cases with missing values.
In the figure above, the 8 judges (7 nations plus "Enthusiast" are the "variables") rating 300 objects, the ANOVA table shows the largest error associated with the "Enthusiast" judge, meaning that judge (variable) is least helpful in forming and differentiating the clusters. All judges/variables are significant, but this is largely meaningless. The ANOVA table is used mainly to look at the size of the mean square errors.
Getting different clusters. Sometimes the researcher wishes to experiment to get different clusters, as when the "Number of cases in each cluster" table shows highly imbalanced clusters and/or clusters with very few members. Different results may occur by setting different initial cluster centers from file (see above), by changing the number of clusters requested, or even by presenting the data file in different case order.
Warning! The CF tree and hence the clustering solution will be affected by the order of the data. See the Assumptions section on randomization, which is strongly recommended.
In the example above, by the BIC criterion alone one would select 4 clusters as being optimal, since the lowest BIC coefficient is the best model. By the SPSS default algorithm, 4 clusters are also selected because this yields a large BIC ratio of change and a large ratio of distances. Note the SPSS algorithm need not agree with the BIC criterion used alone, though it does in this example. When it differs, in essence the SPSS algorithm judges that the gain in information from having more than the number of clusters specified by BIC alone is not worth the increased complexity (diminution of parsimony) of the model. The researcher has the option to override this default and specify 6 or some other number of clusters.
In the example above, automobiles from America, Europe, and Japan were clustered on various attributes (ex., engine size), deriving two clusters. US cars with large engine size dominate the first cluster.
The examples below show variablewise importance plots for the cars example, which included both continuous (top figure) and categorical (bottom figure) variables. The top figure, below, shows that cluster 2, which is the smaller and predominantly European and Japanese cars, is differentiated by the top three variables in a negative direction and by the bottom three variables in a positive direction. The negative factors contribute more to differentiating cluster 2 than the positive ones.
The second plot, below, shows that both categorical variables, country and number of cylinders, differentiate the cars in Cluster 2.
This assumption can be tested by creating a bivariate correlation matrix of the continuous variables to see if the correlations are not significantly different from zero Likewise, crosstabulation can establish if any pair of categorical variables is not significantly related (chi-square test). To establish the independennce of a continuous and categorical variable, the SPSS Means procedure may be used. Two-step clustering is fairly robust even when the assumption of independence is violated.
Copyright 1998, 2008, 2009 by G. David Garson.
Last updated 2/13/2009.