1.Cluster analysis Chong Ho Yu
2.Why do we look at grouping (cluster) patterns? This regression model yields 21% variance explained. The p value is not significant (p=0.0598) But remember we must look at (visualize) the data pattern rather than reporting the numbers
3.These are the data!
4.Regression by cluster
5.Regression by cluster
6.Netflix original How is “House of cards” related to cluster analysis?
7.Crime hot spots How can criminologists find the hot spots?
8.Data reduction Group variables into factors or components based on people’s response patterns PCA Factor analysis Group people into groups or clusters based on variable patterns Cluster analysis
9.CA: ANOVA in reverse In ANOVA participants are assigned into known groups . In cluster analysis groups are created based on the attitudinal or behavioral patterns with reference to certain independent variables.
10.Discriminant analysis (DA) There is a test procedure that is similar to cluster analysis: Discriminant analysis (DA) But in DA both the number of groups (clusters) and their content are known . Based on the known information (examples), you assign the new or unknown observations into the existing groups.
11.Cluster analysis Types : K-mean clustering (SAS, JMP, SPSS) Density-based clustering (SAS) Hierarchical clustering (SAS, JMP, SPSS) Two-step clustering (SPSS) Warning : If there are too many missing data, no clustering algorithm can yield good results.
12.Eye-balling? In a two-dimensional data set (X and Y only), you can “eye-ball” the graph to assign clusters. But it may be subjective . When there are more than two dimensions , assigning by looking is almost impossible.
13.K-mean Select K points as the initial centroids Assign points to different centroids based upon proximity Re-evaluate the centroid of each group Repeat Step 2 and 3 until the best solution emerges (the centers are stable)
14.Sometimes it doesn’t make sense
16.Do these 2 groups make sense?
17.Neither does this make sense Johnson-transform Within-cluster SD
18.D ensity- b ased S patial C lustering of A pplications with N oise (DBSCAN) Groups nearest neighbors together. Available in SAS/Stat Invented in 1996 In 2014 the algorithm won the Test of Time Award at Knowledge Discovery and Data Mining Conference.
19.D ensity- b ased S patial C lustering of A pplications with N oise (DBSCAN) Unlike K-mean, it may not form an ellipse based on a centroid. Could be a string-shaped cluster. Outlier/noise excluded
20.Hierarchical clustering G rouping/matching people like what e-harmony and Christian-Mingle do. Who is the best match? Who is the second best? The third…etc.
21.Hierarchical clustering Top-down or Divisive: start with one group and then partition the data step by step according to the matrices Bottom-up or Agglomerative: start with one single piece of datum and then merge it with others to form larger groups
22.Example: Clustering recovering mental patients What are the relationships between subjective and objective measures of mental illness recovery? What are the profiles of those recovered people in terms of their demographic and clinical attributes based on different configurations of the subjective and objective measures of recovery?
23.Subjective recovery scale (E2 Stage model)
24.Subjective recovery scale
25.Subjective recovery scale
26.Objective scale 1: Vocational status The numbers on the right are the original codes. They were recoded to six levels so that the scale is ordinal . e.g. Employed full time at expected level is better than below expected level.
27.Objective recovery scale 2: Living status The numbers on the right are the original categories. They were collapsed and recoded so that the scale is converted from nominal to ordinal . e.g. Head of household is better than living with family under supervision.
28.Participants 150 recovering or recovered patients (e.g. bipolar, schizophrenia) in Hong Kong. Had not been hospitalized in the past 6 months.
29.Analysis: Correlations among the scales The Spearman’s correlation coefficients are small but significant at the .05 or .01 level. However, the numbers alone might be misleading and further insight could be unveiled via data visualization.