在集合数少的时候韦恩图是很好用的,但是当集合数多比如 5 个以上的时候那就会看的眼花缭乱了。推荐用UpsetR进行集合的绘图。
1. R包的安装及示例文件的加载
install.packages("UpSetR")#CRAN安装
devtools::install_github("hms-dbmi/UpSetR") #Github的安装路径
library(UpSetR)
setwd("工作路径") #按照自己工作路径设置
require(ggplot2);
require(plyr);
require(gridExtra);
require(grid);
movies <- read.csv(system.file("extdata","movies.csv",package = "UpSetR"), header = TRUE, sep=";")
view(movies)#查看示例文件
这个R包里的事例文件如图所示,第一列为电影名,第二列为上映时间,后面就是对电影的分类,比如动作片、喜剧片等等,在进行绘图前可以大致了解一下。
![查看事例文件.png]!(https://upload-images.jianshu.io/upload_images/28604302-a7f6f1f7fa188835.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
2. upset函数的基本参数设置
upset(movies,
order.by = "freq", # 排序方式:freq:降序,degree:升序
nsets = 5, # 展示几个集合,按照数量从大到小排列,
#sets=c("Drama","Comedy","Action","Thriller","Western","Documentary") #使用sets参数指定集合名字
nintersects = 30,#展示交集数
mb.ratio = c(0.55,0.45), # 条形图和矩阵的相对比例
number.angles = 30, # 条形图上面数字角度
point.size = 3, # 点的大小
line.size = 1.2, # 线条粗细
mainbar.y.label = "size of intersection", # 上面条形图的标题
sets.x.label = "the number of each sets", # 坐标条形图的标题
text.scale = c(1.2, 1.3, 1, 1, 2, 1.2), # 元素大小
matrix.color = "firebrick",#点阵的颜色
main.bar.color = "steelblue",#柱状图的颜色
sets.bar.color = "grey70"#图例的颜色
)
简单绘制出来的图形就如上图所示,
1). 在矩阵图中红色点表示该区域是有数据;灰色的点表示该区域没有数据;红色连线表示数据间存在交集;
2).上方蓝色区域的柱状图表示相应的数据值;
3).左边的Set size 条形图表示此次绘图用到的数据类型;
3. 接下来就是该包的高阶用法——queries
主要几个参数:
query——指定内容(如查找交集、元素等)
params——是查询要处理的参数列表
color——将在plot上表示查询颜色,如果没有提供颜色,将从UpSetR默认调色板中选择一种颜色
active——为TURE时候,交叉大小条将被查询的条覆盖;为FALSE时候,则不覆盖。
Example1. 突出显示交集
upset(movies,
queries = list(
list(query = intersects, #指定寻找交集
params = list("Drama", "Comedy", "Action"), #选择"Drama", "Comedy", "Action"(的交集)
color = "orange",#表现为橙色
active = T), #在柱状图上显示
list(query = intersects,
params = list("Drama"), #找"Drama”数据的交集——即突出显示单组数据
color = "red", #红色显示
active = F), #取消柱状图显示,在矩阵中仍能找到该突出点
list(query = intersects,
params = list("Action", "Drama"), #找"Action", "Drama"的交集
active = T)))#由于没有设置颜色,默认从UpSetR的调色板中选择颜色
Example2: 寻找特定元素内容
upset(movies,
queries = list(
list(query = elements,#在数据中寻找相应元素
params = list("AvgRating", 3.5, 4.1),#对元素进行相关限定
color = "blue",
active = T),
list(query = elements,
params = list("ReleaseDate", 1980, 1990, 2000),
color = "red", active = F)))
Example 3: 使用expression参数获得元素查询和交集查询的子集
upset(movies, queries = list(
list(
query = intersects,
params = list("Action","Drama"),
active = T),
list(
query = elements,
params = list("ReleaseDate", 1980, 1990, 2000),
color = "red",
active = F)),
expression = "AvgRating > 3 & Watches > 100")#同时满足【评分】大于3且观【看人数】大于100的子集
这个地方有点难理解,对比看无expression参数和添加该参数的结果图就很容易明白
Example 4: 自定义查询相关元素
根据自己的需求,设置相关函数定义,下面举两个例子展开解释:
Myfunc <- function(row, release, rating) {
data <- (row["ReleaseDate"] %in% release) & (row["AvgRating"] > rating)
}
# 引入三个关键参数 row、release、rating
#【发行日期】符合release且【评分等级】大于rating的列
# 因此新函数需用release和rating两个参数————对应后面的c(1970, 1980, 1990, 1999, 2000)和2.5
upset(movies,
queries = list(
list(
query = Myfunc,
params = list(c(1970, 1980, 1990, 1999, 2000), 2.5),
color = "blue",
active = T)))
下面这个事例就类似,可以参照上面的理解
between <- function(row, min, max){
newData <- (row["ReleaseDate"] < max) & (row["ReleaseDate"] > min)
} #最小值至最大值之间的列赋值给新数据
upset(movies,
sets=c("Drama","Comedy","Action","Thriller","Western","Documentary"),
queries = list(
list(
query = intersects,
params = list("Drama", "Thriller")),
list(query = between,
params=list(1970,1980),
color="red",
active=TRUE)))
Example 5: 引入图例
仅在图上突出显示不够清晰,因此引入图例就格外重要:
upset(movies,
query.legend = "top", #图例位置
queries = list(
list(query = intersects,
params = list("Drama", "Comedy", "Action"),
color = "orange",
active = T,
query.name = "Funny action"),#图例名称
list(query = intersects,
params = list("Action","Drama"),
active=T,
query.name="Emotional action"),#图例名称
list(query = intersects,
params = list("Drama"),
color="red",
active=F)))#未添加图例,会按照顺序默认添加
Example 6: 同时满足多种需求
upset(movies, query.legend = "bottom",
queries = list(
list(query = Myfunc, #按自己需求设置函数
params = list(c(1970,1980, 1990, 1999, 2000), 2.5),
color = "orange",
active = T),
list(query = intersects, #获取交集
params = list("Action", "Drama"),
active = T),
list(query = elements, #突出显示指定元素
params = list("ReleaseDate", 1980, 1990, 2000),
color = "red",
active = F,
query.name = "Decades")),
expression = "AvgRating > 3 & Watches > 100")
7.在Upset图下方聚合指定数据的分布情况
upset(movies,
attribute.plots=list(
gridrows=60, #upset图下面留的间距
plots=list(
list(plot=scatter_plot, #点状图
x="ReleaseDate",
y="AvgRating"),
list(plot=scatter_plot,
x="ReleaseDate",
y="Watches"),
list(plot=scatter_plot,
x="Watches",
y="AvgRating"),
list(plot=histogram, #柱状图
x="ReleaseDate")),
ncols = 2))