stat_summary functions are so flexible that they can save a lot of extra coding effort when they are put to good use.
After the ggplot2 main function defines the mapping, you can directly use stat_summary to plot the graph.
ggplot(.,aes(x = weight , y = species.coverage, fill = weight))+
# geom_boxplot(outlier.size = 1)+
stat_summary(fun = "mean", size = 2, geom = "bar",position = position_dodge(0.75)) +
## 绘制bar,数值来源于计算后的均值
stat_summary(fun.data = "mean_cl_boot", geom = "errorbar", width = .15,position = position_dodge(0.75))
## 添加该列值的置信区间, 计算方法是“mean_cl_boot”,假设不符合正态分布的数值向量
这些函数来源于 Hmisc包
smean.cl.normal computes 3 summary variables: the sample mean and lower and upper Gaussian confidence limits based on the t-distribution.
smean.sd computes the mean and standard deviation.
smean.sdl computes the mean plus or minus a constant times the standard deviation. smean.cl.boot is a very fast implementation of the basic nonparametric bootstrap for obtaining confidence limits for the population mean without assuming normality.
These functions all delete NAs automatically.
smedian.hilow computes the sample median and a selected pair of outer quantiles having equal tail areas.
In this way, the calculation of the histogram + bootstrap + confidence interval is directly completed, which is much simpler than the constructor to calculate these things first.
If you don't use stat_summary functions, you need to use the group_by and summarise functions to calculate CI, which is troublesome.
df <- data.frame(A = rnorm(2000, mean = 15, sd = 18),
B = rnorm(2000, mean = 25, sd = 17)) %>%
pivot_longer(cols = c(A, B), names_to = "group", values_to = "time") %>%
mutate(time = ifelse(time < 2, abs(time) + rnorm(1,15,7), time))
my_cis <- df %>%
group_by(group) %>%
summarize(M = mean(time),
lwr = M - sd(time) / sqrt(length(time)) * 1.96,
upr = M + sd(time) / sqrt(length(time)) * 1.96)
df %>%
ggplot(aes(x = group)) +
geom_jitter(aes(y = time), width = .1, alpha = .2, color = "pink") +
geom_errorbar(aes(ymin = lwr, ymax = upr), data = my_cis, width = .13, color = "gray25") +
geom_point(aes(y = M), data = my_cis, shape = 18, size = 2)
当然,你也可以从ggplot 的stat_summary 中获取这些ci值,使用
ggplot_build(g)函数
可以访问stat_summarywith的数据ggplot_build。
首先, ggplot 调用,存储在一个对象中:
g <- ggplot(iris, aes(x = Species, y = Petal.Length)) +
geom_jitter(width = 0.5) +
stat_summary(fun.y = mean, geom = "point", color = "red") +
stat_summary(fun.data = mean_cl_boot, fun.args=(conf.int=0.9999), geom = "errorbar", width = 0.4)
然后,使用
ggplot_build(g)$data[[3]]
得到 mean_cl_boot:
x group y ymin ymax PANEL xmin xmax colour size linetype width alpha
1 1 1 1.462 1.386000 1.543501 1 0.8 1.2 black 0.5 1 0.4 NA
2 2 2 4.260 4.024899 4.462202 1 1.8 2.2 black 0.5 1 0.4 NA
3 3 3 5.552 5.337199 5.798202 1 2.8 3.2 black 0.5 1 0.4
ref:
r - 使用 mean_cl_boot 获取 stat_summary 计算的值_Stack Overflow中文网
r - What do ggplot's stat_summary errorbars mean? - Cross Validated (stackexchange.com)
smean.sd: Compute Summary Statistics on a Vector in Hmisc: Harrell Miscellaneous (rdrr.io)
通过自定义函数在柱状图/箱线图中添加均值,中位数,样本量等标注信息
自定义函数
get_box_stats <- function(y, upper_limit = max(df$mpg) * 1.15) {
return(data.frame(
y = 0.95 * upper_limit,
label = paste(
"Count =", length(y), "\n",
"Mean =", round(mean(y), 2), "\n",
"Median =", round(median(y), 2), "\n"
)
))
}
然后将该函数应用于stat_summary中
ggplot(df, aes(x = cyl, y = mpg, fill = cyl)) +
geom_boxplot() +
scale_fill_manual(values = c("#0099f8", "#e74c3c", "#2ecc71")) +
stat_summary(fun.data = get_box_stats, geom = "text", hjust = 0.5, vjust = 0.9) +
theme_classic()