R包的使用(以dplyr为例)
加载包及数据
library(dplyr)
test <- iris[c(1:2,51:52,101:102),]
dplyr的5个基础函数
1.mutate() 新增列
mutate(test,new = Sepal.Length * Sepal.Width)
2.select() 按列筛选
2.1按列号
select(test,1)
select(test,c(1,5))
select(test,Sepal.Length)
2.2按列名筛选
select(test, Petal.Length, Petal.Width)
vars <- c("Petal.Length","Petal.Width")
select(test,one_of(vars))
3.filter()筛选行
filter(test, Species == "setosa")
filter(test, Species == "setosa" & Sepal.Length > 5)
filter(test, Species %in% c("setosa", "versicolor"))
4.arrange()按某一列或某几列对整个表格进行排序
arrange(test, Sepal.Length)#默认从小到大
arrange(test,desc(Sepal.Length))#用desc从大到小排序
5.summarise() 汇总
summarise(test,mean(Sepal.Length),sd(Sepal.Length))
group_by(test,Species)
summarise(group_by(test,Species),mean(Sepal.Length),sd(Sepal.Length))
dplyr 的两个实用技能
1.管道操作 %>% (cmd/ctr + shift + M)
test %>%
group_by(Species) %>%
summarise(mean(Sepal.Length),sd(Sepal.Length))
2.count统计某列的unique值
count(test,Species)
dplyr处理关系数据(将两个表格连接,但不引入factor)
options(stringsAsFactors = F)
test1 <- data.frame(x = c("b","e","f","x"),
z = c("A","B","C","D"),
stringsAsFactors = F)
test1
test2 <- data.frame(x = c("a","b","c","d","e","f"),
y = c(1,2,3,4,5,6),
stringsAsFactors = F)
test2
1.内连inner_join 取交集
inner_join(test1, test2, by = "x")
2.左连left_join
left_join(test1, test2, by = "x")
left_join(test2, test1, by = "x")
3.全连full_join
full_join( test1, test2, by = "x")
4.半连接semi_join(返回能够与y表匹配的x表所有记录)
semi_join(x = test1, y = test2, by = "x")
5.反连接anti_join (返回无法与x表匹配的y表的所有记录)
anti_join(x = test2, y = test1, by = "x")
6.简单的合并
相当于base包里的cbind()函数和rbind()函数。
Note:bind_rows()函数需要两个表格列数相同,而bind_cols()函数则需要两个数据框相同的行数
test1 <- data.frame(x = c(1,2,3,4), y = c(10,20,30,40))
test1
test2 <- data.frame(x = c(5,6), y = c(50,60))
test2
test3 <- data_frame(z = c(100,200,300,400))
test3
bind_rows(test1,test2)
bind_cols(test1,test3)
``