主要用到dplyr
和tidyr
数据处理和分析以及数据的理解占据一个数据科学项目大量的时间。这一部分工作非常繁琐,但是这一个步骤又是非常有必要以及重要的。
恐怖大师--斯蒂芬·金写的一本写作指南《写作这回事》里面提到了作家的一个工具箱,好的作家需要有一个工具箱,里面放着自己非常顺手的工具。
好的数据科学家和好的作家一样,需要有自己的熟练的工具,市面上的任何工具都可以,只要你能很好的使用它。吹毛求疵的人希望制造自己的工具,当然,有能力有资源这样做是无伤大雅。
但是,一般而言,过于吹毛求疵只会得不偿失。
我这里讲一下我用的某些工具。
tidyr. packages
数据重塑
1. 将宽数据转换成为短数据
> table4a
# A tibble: 3 x 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
> gather(table4a, `1999`, `2000`, key = "year", value = "cases")
# A tibble: 6 x 3
country year cases
<chr> <chr> <int>
1 Afghanistan 1999 745
2 Brazil 1999 37737
3 China 1999 212258
4 Afghanistan 2000 2666
5 Brazil 2000 80488
6 China 2000 213766
2. 将长数据转换成为宽数据
table2
# A tibble: 12 x 4
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
> spread(table2, type, count)
# A tibble: 6 x 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
3. 分割某一列
> table3
# A tibble: 6 x 3
country year rate
* <chr> <int> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
> separate(table3, rate, into = c("cases", "pop"))
# A tibble: 6 x 4
country year cases pop
* <chr> <int> <chr> <chr>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
4. 分割某一列然后变成多行
> table3
# A tibble: 6 x 3
country year rate
* <chr> <int> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
> separate_rows(table3, rate)
# A tibble: 12 x 3
country year rate
<chr> <int> <chr>
1 Afghanistan 1999 745
2 Afghanistan 1999 19987071
3 Afghanistan 2000 2666
4 Afghanistan 2000 20595360
5 Brazil 1999 37737
6 Brazil 1999 172006362
7 Brazil 2000 80488
8 Brazil 2000 174504898
9 China 1999 212258
10 China 1999 1272915272
11 China 2000 213766
12 China 2000 1280428583
5. 合并两列
> table5
# A tibble: 6 x 4
country century year rate
* <chr> <chr> <chr> <chr>
1 Afghanistan 19 99 745/19987071
2 Afghanistan 20 00 2666/20595360
3 Brazil 19 99 37737/172006362
4 Brazil 20 00 80488/174504898
5 China 19 99 212258/1272915272
6 China 20 00 213766/1280428583
> unite(table5, century, year, col = "year", sep = "")
# A tibble: 6 x 3
country year rate
<chr> <chr> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
处理缺失值
drop_na
fill
replace_na
扩充数据
> head(mtcars)
mpg cyl disp hp drat wt
Mazda RX4 21.0 6 160 110 3.90 2.620
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875
Datsun 710 22.8 4 108 93 3.85 2.320
Hornet 4 Drive 21.4 6 258 110 3.08 3.215
Hornet Sportabout 18.7 8 360 175 3.15 3.440
Valiant 18.1 6 225 105 2.76 3.460
qsec vs am gear carb
Mazda RX4 16.46 0 1 4 4
Mazda RX4 Wag 17.02 0 1 4 4
Datsun 710 18.61 1 1 4 1
Hornet 4 Drive 19.44 1 0 3 1
Hornet Sportabout 17.02 0 0 3 2
Valiant 20.22 1 0 3 1
> complete(mtcars, cyl, gear, carb)
# A tibble: 74 x 11
cyl gear carb mpg disp hp drat
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 3 1 21.5 120. 97 3.7
2 4 3 2 NA NA NA NA
3 4 3 3 NA NA NA NA
4 4 3 4 NA NA NA NA
5 4 3 6 NA NA NA NA
6 4 3 8 NA NA NA NA
7 4 4 1 22.8 108 93 3.85
8 4 4 1 32.4 78.7 66 4.08
9 4 4 1 33.9 71.1 65 4.22
10 4 4 1 27.3 79 66 4.08
# ... with 64 more rows, and 4 more variables:
# wt <dbl>, qsec <dbl>, vs <dbl>, am <dbl>
只扩充选中好的列
> head(mtcars)
mpg cyl disp hp drat wt
Mazda RX4 21.0 6 160 110 3.90 2.620
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875
Datsun 710 22.8 4 108 93 3.85 2.320
Hornet 4 Drive 21.4 6 258 110 3.08 3.215
Hornet Sportabout 18.7 8 360 175 3.15 3.440
Valiant 18.1 6 225 105 2.76 3.460
qsec vs am gear carb
Mazda RX4 16.46 0 1 4 4
Mazda RX4 Wag 17.02 0 1 4 4
Datsun 710 18.61 1 1 4 1
Hornet 4 Drive 19.44 1 0 3 1
Hornet Sportabout 17.02 0 0 3 2
Valiant 20.22 1 0 3 1
> expand(mtcars, cyl, gear, carb)
# A tibble: 54 x 3
cyl gear carb
<dbl> <dbl> <dbl>
1 4 3 1
2 4 3 2
3 4 3 3
4 4 3 4
5 4 3 6
6 4 3 8
7 4 4 1
8 4 4 2
9 4 4 3
10 4 4 4
# ... with 44 more rows
dplyr.packages
1. group by
head(iris)
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5.0 3.6 1.4
6 5.4 3.9 1.7
Petal.Width Species
1 0.2 setosa
2 0.2 setosa
3 0.2 setosa
4 0.2 setosa
5 0.2 setosa
6 0.4 setosa
> group_by(iris, Species)
# A tibble: 150 x 5
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length
<dbl> <dbl> <dbl>
1 5.1 3.5 1.4
2 4.9 3 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5 3.6 1.4
6 5.4 3.9 1.7
7 4.6 3.4 1.4
8 5 3.4 1.5
9 4.4 2.9 1.4
10 4.9 3.1 1.5
# ... with 140 more rows, and 2 more variables:
# Petal.Width <dbl>, Species <fct>
配合summarise
进行一系列聚合操作,比如group by 取平均值.
group_by(iris, Species) %>% summarise(MEAN=mean(Sepal.Length))
# A tibble: 3 x 2
Species MEAN
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59
2.筛选数据的行
1.filter
> starwars
# A tibble: 87 x 13
name height mass hair_color skin_color
<chr> <int> <dbl> <chr> <chr>
1 Luke… 172 77 blond fair
2 C-3PO 167 75 NA gold
3 R2-D2 96 32 NA white, bl…
4 Dart… 202 136 none white
5 Leia… 150 49 brown light
6 Owen… 178 120 brown, gr… light
7 Beru… 165 75 brown light
8 R5-D4 97 32 NA white, red
9 Bigg… 183 84 black light
10 Obi-… 182 77 auburn, w… fair
# ... with 77 more rows, and 8 more variables:
# eye_color <chr>, birth_year <dbl>,
# gender <chr>, homeworld <chr>,
# species <chr>, films <list>,
# vehicles <list>, starships <list>
> filter(starwars, mass > 1000)
# A tibble: 1 x 13
name height mass hair_color skin_color
<chr> <int> <dbl> <chr> <chr>
1 Jabb… 175 1358 NA green-tan…
# ... with 8 more variables: eye_color <chr>,
# birth_year <dbl>, gender <chr>,
# homeworld <chr>, species <chr>,
# films <list>, vehicles <list>,
# starships <list>
- 去重
distinct(.data, ..., .keep_all = FALSE)
可以选择根据某一行去重 - 抽样
sample_frac(tbl, size = 1, replace = FALSE, weight = NULL, .env = parent.frame())
- 通过具体的行数进行选择变量
slice(.data, ...)
排序
arrange(mtcars, mpg)
降序排序
arrange(mtcars, desc(mpg))
添加一行
add_row(faithful, eruptions = 1, waiting = 1)
对列进行操作
选择某些列
select()
生成一个新列
mutate(mtcars, gpm = 1/mpg)
生成一个新列,去除老的列
transmute(mtcars, gpm = 1/mpg)
4.对所有的行进行转换mutate_if(iris, is.numeric, funs(log(.)))
连接
left_join
right_join
inner_join
full_join
合并数据
bind_rows
集合运算
- 交集
intersect
- 补集
setdiff
- 并集
union
还有很多细节,太繁琐没有写下来。rstudio官网里面有很多学习资料。链接如下
https://www.rstudio.com/resources/cheatsheets/