Calculating distance between observations计算两点间距离
lims(x = c(-30,30), y = c(-20, 20)) 应用于ggplot中,可以设置图标坐标轴的范围
dist(two_players) dist(data.frame)会计算出数据结构中各个点相互之间的举例
scale(data.frame)后 再dist,可以消除因为同组数之间相差太大引起的影响,比如一个是千米,一个是毫升这种毫不相关的量。即矩阵的中心化。
如果data.frame中的data是 YES/NO LOW/MIDDLE/HIGH这样的组合 如何用dist进行计算呢?
首先,library(dummies)
dummy_survey <- dummy.data.frame(job_survey)用dummy的数据结构格式转化
dist_survey <- dist(dummy_survey, method = 'binary')再dist,方式选择二进制
以下是method的取值
euclidean 欧几里德距离,就是平方再开方。
maximum 切比雪夫距离
manhattan 绝对值距离
canberra Lance 距离
minkowski 明科夫斯基距离,使用时要指定p值
binary 定性变量距离.
矩阵中给出各个参数互相之间的关联值,其中其他数据对一组数据的关联值,分三个方面:
- Complete: the resulting distance is based on the maximum,max()
- Single: the resulting distance is based on the minimum,min()
- Average: the resulting distance is based on the average,mean()
hc_players <- hclust(dist_players, method = "complete")
clusters_k2 <- cutree(hc_players, k = 2)
hclust()是聚类函数
cutree(k = )从中提取聚类后的???
library(dendextend)
color_branches()
dend_20 <- color_branches(dend_players, h = 20)
library(dendextend)
dist_players <- dist(lineup, method = 'euclidean')
hc_players <- hclust(dist_players, method = "complete")
dend_players <- as.dendrogram(hc_players)as.dendrogram这里是转化成什么格式?
plot(dend_players)做出来是树状图
dend_20 <- color_branches(dend_players, h = 20) color_branches是给树状图上色,h是指上色的高度
dist_customers <- dist(customers_spend)计算两点距离
hc_customers <- hclust(dist_customers, method = "complete")用hclust聚类之
plot(hc_customers)画出聚类后的树状图
clust_customers <- cutree(hc_customers, h = 15000)设置一个高度限制,cutree,这里的h具体是指代什么?
segment_customers <- mutate(customers_spend, cluster = clust_customers)将cutree下来的各组数的组别加入到原始datafram中成为新的一列cluster
ggplot中的ifelse
K-means clustering K值平均分类
kmeans(lineup, centers = 2)创建一个k均值模型,此处k=几就是分为按颜色分为几类。
clust_km2 <- model_km2$cluster模型中的cluster列选出来
lineup_km2 <- mutate(lineup, cluster = clust_km2)将模型中分配好组的cluster列加入原来的数据结构中
ggplot(lineup_km2, aes(x = x, y = y, color = factor(cluster))) +
geom_point()绘制出来,利用散点图看出分组情况。此处有关ggplot中的颜色要不要factor()之,是因为如果不转化为因子,那么原来的格式是int,是连续的,按颜色分类时就会是一个连续的按颜色渐变分类,如果变成factor后就会变成离散型的分类,也就是说从1~2变成了1,2这样的分类。
library(purrr)
tot_withinss <- map_dbl(1:10, function(k){
model <- kmeans(x = lineup, centers = k)
model$tot.withinss
})
elbow_df <- data.frame(
k = 1:10,
tot_withinss = tot_withinss
)
取很多个K值(从1到10)
library(cluster)
pam()与kmeans的功能类似,都是创建模型model。pam_k2 <- pam(lineup, k = 2)
kmeans是围绕均值进行划分,对异常值敏感。而pam更稳健,是对于中心值划分。
silhouette()
plot(silhouette(pam_k2))绘制出相关的条形图
sil_width <- map_dbl(2:10, function(k){
model <- pam(x = customers_spend, k = k)
modelavg.width
})
sil_df <- data.frame(
k = 2:10,
sil_width = sil_width
)
ggplot(sil_df, aes(x = k, y = sil_width)) +
geom_line() +
scale_x_continuous(breaks = 2:10)
批量设置K值然后绘制出关于K值的折线图来确定K值
segment_customers %>%
group_by(cluster) %>%
summarise_all(funs(mean(.)))
分类汇总查看之前的结果
Case Study: National Occupational mean wage
library(tibble)
rownames_to_column(as.data.frame(oes), var = 'occupation')此函数可以将数据结构中的每一列的名字转化为一列存储起来,其新的这一列的名称就是var = '...'