PCA 适用于连续变量。目的是为了减少方差(squared deviations), 并可以将nominal改为binary,但是squared deviation不适用于二元变量(binary variables)
对于分类变量,MCA(multiple correspondence analysis) 更好
我们虽然可以对有序数据使用PCA (Likert scale), 但是相关性会很小
因此,对于分类问题我们使用 cluster analysis
Cluster Analysis
Statistical method of partitioning(分割) a sample into homogeneous classes
Purpose
- Sort observations into groups (or clusters) such that the degree of association is:
- Strong between members of the same cluster
- Weak between members of different clusters
- Define a formal classification scheme that was not previously evident
Supervised vs unsupervised learning
- Supervised
Can train your model and use it for “new” data with some accuracy
Initial model: Use a portion of the data to “train” your data
and ”test” using the remaining portion
e.g., Linear and logistic regression, negative binomial - Unsupervised
Does not use output data for further learning
e.g., Cluster analysis
Classification
- The classification produced is very dependent upon the particular method used. Therefore, it lacks an underlying body of statistical theory (heuristic in nature)
- It is possible to measure similarity and dissimilarity in a number of ways
- No such thing as a single correct classification
- Requires decisions by the user relating to classification that can have strong influence on results
Hierarchical or non-hierarchical
Hierarchical: Resultant classification has an increasing number of nested classes
Non-hierarchical: There is no hierarchy and the data are partitioned
- Have a pre-determined number of cluster groups
- k-means clustering
Divisive or agglomerative
Divisive (top down分割): begins with all cases in one cluster and is
gradually broken down into smaller clusters
Agglomerative (Bottom up聚合,较为常用): starts with single member
clusters and are gradually fused until one large cluster is formed
Classification Scheme: Monothetic or Polythetic
Monothetic: cluster membership is based on the presence or absence of a single characteristic (拥有点或缺失点) 就是根据有没有这个特征进行分类。
Polythetic: uses more than one characteristics (variable),较为常用
Polythetic, agglomerative classification steps
- Distance measures
距离可以一维可以多维,可以是真实距离也可以是派生距离 - Example Distance Measures
-- Euclidean distance: most common(欧式距离):变量个体差值的平方和的平方根
-- Squared Euclidean distance(平方欧式距离):变量个体差值的平方和
-- City-black (Manhattan distance): cannot go in a diagonal
-- Chebychev distance: defines two objects as different if they differ on ANY dimension变量差值的绝对值得最大值
-- Power distance: accounts for progressive weight on individual dimensions - Clustering methods - example linkage
In the agglomerative hierarchical clustering approach, distances are defined by a chosen distance measures
-- Simple linkage: Nearest neighbor distance (if ANY object in one cluster is close to ANY object in another cluster)
-- Complete linkage: measures distance between furthest objects
-- Average linkage: based on distance from all objects in a group
-- Centroid linkage: uses group centroids
-- Ward’s method: uses sum of squares (variances)
-- Density linkage
-- Maximum likelihood...
我们需要多少的cluster?
自己定
也可以使用AIC = -2LL+2k or BIC = -2LL+log(n)k
并且使用dendrogram分析结果
R中适用silhouette plot 确定cluster的数量,最好不要使用scree plot
怎么算是成功的分类?
与别人的差异大,与自己人的差异小
- Each cluster is very different from other clusters (between
cluster heterogeneity) - Individuals within cluster are as similar as possible
(within-cluster homogeneity)
Variance measures
- Root Mean Square Standard Deviation (RMSSTD)
measures homogeneity within clusters 越小越好 - Semi-Partial R^2: measures loss of homogeneity due to merging (值越小两cluster越接近) 在merge时可以用到
- Centroid Distance: measures heterogeneity of clusters merged (值越大两cluster越不同)
- R2 (RS): extent to which clusters are different from each other; large if very different
有很多算法去优化cluster的数量
- Cubic Clustering Criterion (CCC) (Sarle, 1983): comparing the R2 for a cluster number with a default cluster
- Dynamic Local Search solves the number and location of the clusters jointly (Kärkkäinen and Fränti, ICPR 2002)
Hierarchical clustering
- Set up the distance matrix
**分类变量采用非欧式距离 - Uses agglomerative approach(此时每一个observation均为一个cluster)
- 确定linkage method
- cut tree to three clusters
- check data
- plot groups
Computing a dissimilarity matrix with categorical data
Using Gower distance
Non-hierarchical
- no hierarchical structure
- pre-decide group numbers(与hierarchical聚类的区别)
- three major approacher:
(1) sequential threshold: 一次聚一类,即再聚下一类前遍历所有点
(2) parallel threshold: 一次聚多类,实时更新点,mebership threshold distance 也会调整
(3) optimizing:重新分配目标为了优化整体标准 - setting a seed
the number of partitions begins with a randomly chosen centroid
k-means clustering,一个典型的non-hierarchical聚类方法
sequential threshold
Step 1:选择初始k个聚类中心
Step 2: 考虑每个观测值并分配其到各个cluster
Step 3:cluster的中心在每次分配后重新计算
优势:1. 处理大量数据集时十分有效 2. often terminates at a local optimum 3. the clusters have convex shapes
劣势:1.针对数值型数据 2.需要提前决定聚类数量 3. 不能很好处理noisy data 和outliers 4. 不适用于non-convex的cluster
k均值聚类的结果差别会出现在
1.最初k值的选择;2. 相似性计算; 3.计算组均值的策略(hill-climbing)
为聚类类别性数据我们会使用k-mode,k-prototype同时适用于类别性数据和数值型数据
最后加一个分析可以明显看出corona的分类主要是第二类。