应用R语言完成相关性检验,相关性矩阵及相关性可视化
首先安装相应的R包
require(ggpubr)
## Loading required package: ggpubr
## Loading required package: ggplot2
## Loading required package: magrittr
require(tidyverse)
## Loading required package: tidyverse
## -- Attaching packages ---------------------------------------------------- tidyverse 1.2.1 --
## √ tibble 2.0.1 √ purrr 0.3.0
## √ tidyr 0.8.2 √ dplyr 0.8.0.1
## √ readr 1.3.1 √ stringr 1.4.0
## √ tibble 2.0.1 √ forcats 0.4.0
## -- Conflicts ------------------------------------------------------- tidyverse_conflicts() --
## x tidyr::extract() masks magrittr::extract()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::set_names() masks magrittr::set_names()
require(Hmisc)#相关性矩阵计算
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
require(corrplot)#相关性图
## Loading required package: corrplot
## corrplot 0.84 loaded
相关性分析的方法
Pearson correlation,属于最常用的一种,用于计算两个变量之间的线下相关系数,应用于正态分布的数据,可以绘制线下回归曲线 Kendall 以及 Spearman法,是一只能怪基于秩的相关性检验,也称非参数
在R中的计算
相关系数可以使用函数计算
cor函数计算相关系数 cor.test 计算相关系数,返回结果包括相关系数,统计量等
举个简单示例
当x,y是相同的数值向量时,很简单
x=rnorm(10)
y=1:10
cor(x,y,method = "pearson")
## [1] -0.5518297
cor(x, y, method=c("pearson", "kendall", "spearman"))
## [1] -0.5518297
##如果存在缺失值
cor(x, y, method = "pearson", use = "complete.obs")
## [1] -0.5518297
cor.test(x, y, method = "pearson", use = "complete.obs")##处理缺失值
##
## Pearson's product-moment correlation
##
## data: x and y
## t = -1.8716, df = 8, p-value = 0.09817
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8768111 0.1192188
## sample estimates:
## cor
## -0.5518297
示例分析
使用大家熟悉的mtcars作为示例数据集
class(mtcars)
## [1] "data.frame"
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
dim(mtcars)#是一个32行,11列的数据框
## [1] 32 11
先调整下数据格式
my_data <- mtcars
my_data$cyl <- factor(my_data$cyl)##转换位因子作分类
str(my_data)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
下面我们来计算下mpg与drat的相关系数,并且可视化
library(ggpubr)
ggscatter(my_data,
x = "drat", #x变量
y = "mpg",#y变量
add = "reg.line",##拟合曲线
conf.int = TRUE,##置信区间阴影带
cor.coef = TRUE, ##系数
cor.method = "pearson",#方法
xlab = "drat", ## x轴
ylab = "mg")## y轴
预先测试数据是否符合正态分布
Shapiro-Wilk,零检验:正态分布, p>0.05,认为与正态分布没有差异
如果检验不符合正态分布或接近,建议使用非参数检验,spearman等
Shapiro-Wilk 变量正态分布检验
shapiro.test(my_data$mpg) # => p = 0.1229
##
## Shapiro-Wilk normality test
##
## data: my_data$mpg
## W = 0.94756, p-value = 0.1229
shapiro.test(my_data$drat)# => p = 0.1101
##
## Shapiro-Wilk normality test
##
## data: my_data$drat
## W = 0.94588, p-value = 0.1101
# Q-Q图绘制给定样本与理论正态分布之间的相关性,非常相符
library("ggpubr")
# Check for the normality of "mpg""
ggqqplot(my_data$mpg, ylab = "MPG")
res <- cor.test(my_data$drat, my_data$mpg, method = "pearson")
res
##
## Pearson's product-moment correlation
##
## data: my_data$drat and my_data$mpg
## t = 5.096, df = 30, p-value = 1.776e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4360484 0.8322010
## sample estimates:
## cor
## 0.6811719
t值是t检验的统计量
df指自由度
p-value指t检验的检验水平
conf.int指confidence interval相关系数的95%可信区间
sample estimates 是相关系数的值
该结果可以解释为mpg与draft有相关性,相关系数为0.68,p<0.05
考虑到有些新手同学不会获取cor.test的结果
实际上就是一个列表数据的获取
str(res)##发现其实检验结果是返回的列表
## List of 9
## $ statistic : Named num 5.1
## ..- attr(*, "names")= chr "t"
## $ parameter : Named int 30
## ..- attr(*, "names")= chr "df"
## $ p.value : num 1.78e-05
## $ estimate : Named num 0.681
## ..- attr(*, "names")= chr "cor"
## $ null.value : Named num 0
## ..- attr(*, "names")= chr "correlation"
## $ alternative: chr "two.sided"
## $ method : chr "Pearson's product-moment correlation"
## $ data.name : chr "my_data$drat and my_data$mpg"
## $ conf.int : num [1:2] 0.436 0.832
## ..- attr(*, "conf.level")= num 0.95
## - attr(*, "class")= chr "htest"
res$p.value##获取pvalue
## [1] 1.77624e-05
res$conf.int## 95% CI
## [1] 0.4360484 0.8322010
## attr(,"conf.level")
## [1] 0.95
res$estimate## cor
## cor
## 0.6811719
Kendall法相关性检验
注意其中tau是相关系数的值
res2 <- cor.test(my_data$mpg, my_data$wt, method = "kendall")
## Warning in cor.test.default(my_data$mpg, my_data$wt, method = "kendall"):
## Cannot compute exact p-value with ties
res2
##
## Kendall's rank correlation tau
##
## data: my_data$mpg and my_data$wt
## z = -5.7981, p-value = 6.706e-09
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.7278321
Spearman法相关性系数-rho是其相关性系数
res3 <- cor.test(my_data$wt, my_data$mpg, method = "spearman")
## Warning in cor.test.default(my_data$wt, my_data$mpg, method = "spearman"):
## Cannot compute exact p-value with ties
res3
##
## Spearman's rank correlation rho
##
## data: my_data$wt and my_data$mpg
## S = 10292, p-value = 1.488e-11
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.886422