Hassan Kibirige开发的plydata包借鉴了R dplyr包的语法以及实现了管道函数%>%,并且与该作者另一个包plotnine(类ggplot2语法绘图工具包),本文在于对plydata包语法的入门以及与dplyr包的对比,让对R熟悉的同学(比如我)更好的在python进行数据处理工作。
首先导入包以及数据集,数据集使用经典的nycflights13中的flights,这个数据集在python和R中均有包直接调用
# in R
library(dplyr)
library(nycflights13)
> head(flights)
# A tibble: 6 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15
2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29
3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40
4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45
5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0
6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 150 719 5 58
# ... with 1 more variable: time_hour <dttm>
# in Python
import pandas as pd
import numpy as np
from plydata import *
from nycflights13 import flights
flights.head(6)
Out[]:
year month day dep_time ... distance hour minute time_hour
0 2013 1 1 517.0 ... 1400 5 15 2013-01-01T10:00:00Z
1 2013 1 1 533.0 ... 1416 5 29 2013-01-01T10:00:00Z
2 2013 1 1 542.0 ... 1089 5 40 2013-01-01T10:00:00Z
3 2013 1 1 544.0 ... 1576 5 45 2013-01-01T10:00:00Z
4 2013 1 1 554.0 ... 762 6 0 2013-01-01T11:00:00Z
5 2013 1 1 554.0 ... 719 5 58 2013-01-01T10:00:00Z
[6 rows x 19 columns]
Single Table Verbs
dplyr与plydata的数据处理逻辑为提供一个函数,仅仅执行一种Verb数据处理(data manipulation)。所有的Verbs可以归于以下三类:
- 行
-
filter()
根据条件筛选行 注:python中为query()
-
slice()
根据位置选择行 -
arrange()
排列行
-
- 列
-
select()
选择制定列 -
rename()
修改列名 -
mutate()
创建新列 -
relocate()
修改列顺序
-
- 行聚合
-
summarise()
根据运算方式聚合为一行
-
Pipe - 管道操作
所有的dplyr与plydata函数均取data frame为第一个参数,所以与之每个函数中放入data frame,管道操作提供了一个数据处理的流水线,通过%>%
和>>
符号实现。
filter()与query()
- 选取数据集中飞行距离小于50英里的航班
# in R
> flights %>% filter(distance < 50)
# A tibble: 1 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 7 27 NA 106 NA NA 245 NA US 1632 NA EWR LGA NA 17 1 6
# ... with 1 more variable: time_hour <dttm>
# in Python
flights >> query('distance < 50')
Out[]:
year month day ... hour minute time_hour
275945 2013 7 27 ... 1 6 2013-07-27T05:00:00Z
[1 rows x 19 columns]
由结果可知,2013年唯一一个飞行距离低于50英里的航班是从纽约NWR纽瓦克机场飞到纽约LGA拉瓜地机场
- 选取所有1月1日从LGA机场起飞至DTW机场的航班
# in R
> flights %>% filter(month == 1 & day == 1 & origin == 'LGA' & dest == 'DTW')
# A tibble: 12 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 602 605 -3 821 805 16 MQ 4401 N730MQ LGA DTW 105 502 6 5
2 2013 1 1 659 705 -6 907 913 -6 DL 831 N998DL LGA DTW 105 502 7 5
3 2013 1 1 857 905 -8 1107 1120 -13 DL 181 N321NB LGA DTW 110 502 9 5
4 2013 1 1 858 900 -2 1102 1110 -8 MQ 4478 N737MQ LGA DTW 103 502 9 0
5 2013 1 1 1048 1050 -2 1302 1250 12 MQ 4589 N537MQ LGA DTW 112 502 10 50
6 2013 1 1 1113 1115 -2 1318 1315 3 DL 1031 N320NB LGA DTW 104 502 11 15
7 2013 1 1 1231 1238 -7 1449 1446 3 DL 1131 N920DL LGA DTW 108 502 12 38
8 2013 1 1 1305 1315 -10 1523 1520 3 MQ 4564 N725MQ LGA DTW 102 502 13 15
9 2013 1 1 1440 1440 0 1658 1643 15 DL 1231 N926DL LGA DTW 94 502 14 40
10 2013 1 1 1640 1640 0 1859 1850 9 MQ 4540 N723MQ LGA DTW 106 502 16 40
11 2013 1 1 1739 1745 -6 1956 1953 3 DL 2331 N965DL LGA DTW 93 502 17 45
12 2013 1 1 2053 2055 -2 2254 2250 4 MQ 4573 N730MQ LGA DTW 102 502 20 55
# ... with 1 more variable: time_hour <dttm>
# in Python
flights >> query("month == 1 & day == 1 & origin == 'LGA' & dest == 'DTW'")
Out[490]:
year month day dep_time ... distance hour minute time_hour
21 2013 1 1 602.0 ... 502 6 5 2013-01-01T11:00:00Z
66 2013 1 1 659.0 ... 502 7 5 2013-01-01T12:00:00Z
163 2013 1 1 857.0 ... 502 9 5 2013-01-01T14:00:00Z
166 2013 1 1 858.0 ... 502 9 0 2013-01-01T14:00:00Z
251 2013 1 1 1048.0 ... 502 10 50 2013-01-01T15:00:00Z
267 2013 1 1 1113.0 ... 502 11 15 2013-01-01T16:00:00Z
324 2013 1 1 1231.0 ... 502 12 38 2013-01-01T17:00:00Z
355 2013 1 1 1305.0 ... 502 13 15 2013-01-01T18:00:00Z
419 2013 1 1 1440.0 ... 502 14 40 2013-01-01T19:00:00Z
558 2013 1 1 1640.0 ... 502 16 40 2013-01-01T21:00:00Z
612 2013 1 1 1739.0 ... 502 17 45 2013-01-01T22:00:00Z
782 2013 1 1 2053.0 ... 502 20 55 2013-01-02T01:00:00Z
[12 rows x 19 columns]
...
[44 rows x 19 columns]
python的plydata包目前无法使用
,
作为多个条件中的分割,但是可以使用&
and
代替
arrange()
arrange()可以通过一个或多个字段进行排序
- 对所有1月1日从LGA机场起飞至DTW机场的航班,按照延误时间排序,延误时间相同的话按照起飞时间逆序排序
# in R
> flights %>% filter(month == 1 & day == 1 & origin == 'LGA' & dest == 'DTW') %>% arrange(arr_delay, dep_time)
# A tibble: 12 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 857 905 -8 1107 1120 -13 DL 181 N321NB LGA DTW 110 502 9 5
2 2013 1 1 858 900 -2 1102 1110 -8 MQ 4478 N737MQ LGA DTW 103 502 9 0
3 2013 1 1 659 705 -6 907 913 -6 DL 831 N998DL LGA DTW 105 502 7 5
4 2013 1 1 1113 1115 -2 1318 1315 3 DL 1031 N320NB LGA DTW 104 502 11 15
5 2013 1 1 1231 1238 -7 1449 1446 3 DL 1131 N920DL LGA DTW 108 502 12 38
6 2013 1 1 1305 1315 -10 1523 1520 3 MQ 4564 N725MQ LGA DTW 102 502 13 15
7 2013 1 1 1739 1745 -6 1956 1953 3 DL 2331 N965DL LGA DTW 93 502 17 45
8 2013 1 1 2053 2055 -2 2254 2250 4 MQ 4573 N730MQ LGA DTW 102 502 20 55
9 2013 1 1 1640 1640 0 1859 1850 9 MQ 4540 N723MQ LGA DTW 106 502 16 40
10 2013 1 1 1048 1050 -2 1302 1250 12 MQ 4589 N537MQ LGA DTW 112 502 10 50
11 2013 1 1 1440 1440 0 1658 1643 15 DL 1231 N926DL LGA DTW 94 502 14 40
12 2013 1 1 602 605 -3 821 805 16 MQ 4401 N730MQ LGA DTW 105 502 6 5
# ... with 1 more variable: time_hour <dttm>
# in Python
flights >> query("month == 1 & day == 1 & origin == 'LGA' & dest == 'DTW'") >> arrange("arr_delay", "-dep_time")
Out[]:
year month day dep_time ... distance hour minute time_hour
163 2013 1 1 857.0 ... 502 9 5 2013-01-01T14:00:00Z
166 2013 1 1 858.0 ... 502 9 0 2013-01-01T14:00:00Z
66 2013 1 1 659.0 ... 502 7 5 2013-01-01T12:00:00Z
612 2013 1 1 1739.0 ... 502 17 45 2013-01-01T22:00:00Z
355 2013 1 1 1305.0 ... 502 13 15 2013-01-01T18:00:00Z
324 2013 1 1 1231.0 ... 502 12 38 2013-01-01T17:00:00Z
267 2013 1 1 1113.0 ... 502 11 15 2013-01-01T16:00:00Z
782 2013 1 1 2053.0 ... 502 20 55 2013-01-02T01:00:00Z
558 2013 1 1 1640.0 ... 502 16 40 2013-01-01T21:00:00Z
251 2013 1 1 1048.0 ... 502 10 50 2013-01-01T15:00:00Z
419 2013 1 1 1440.0 ... 502 14 40 2013-01-01T19:00:00Z
21 2013 1 1 602.0 ... 502 6 5 2013-01-01T11:00:00Z
[12 rows x 19 columns]
slice()与slice_rows()
slice()与slice_rows()可以通过行号来选取行
# in R
> flights %>% slice(1:5)
# A tibble: 5 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15
2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29
3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40
4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45
5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0
# ... with 1 more variable: time_hour <dttm>
# in Python
> flights %>% slice(1:5)
# A tibble: 5 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15
2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29
3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40
4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45
5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0
# ... with 1 more variable: time_hour <dttm>
python中的计数项是从0开始,R中是从1开始
与dplyr中slice_head()和slice_tail()对应的函数是head()和tai()
select()
select()提供选取某列的功能,基本等同于sql中的select
# in R
> flights %>% slice(1:5) %>% select(origin, dest, distance)
# A tibble: 5 x 3
origin dest distance
<chr> <chr> <dbl>
1 EWR IAH 1400
2 LGA IAH 1416
3 JFK MIA 1089
4 JFK BQN 1576
5 LGA ATL 762
# in Python
flights >> slice_rows(0, 5) >> select('origin', 'dest', 'distance')
Out[509]:
origin dest distance
0 EWR IAH 1400
1 LGA IAH 1416
2 JFK MIA 1089
3 JFK BQN 1576
4 LGA ATL 762
添加end_with/endswith可以截取以特定字符结尾的列
# in R
> flights %>% slice(1:5) %>% select(ends_with('time'))
# A tibble: 5 x 5
dep_time sched_dep_time arr_time sched_arr_time air_time
<int> <int> <int> <int> <dbl>
1 517 515 830 819 227
2 533 529 850 830 227
3 542 540 923 850 160
4 544 545 1004 1022 183
5 554 600 812 837 116
# in Python
flights >> slice_rows(0, 5) >> select(endswith = 'time')
Out[511]:
dep_time sched_dep_time arr_time sched_arr_time air_time
0 517.0 515 830.0 819 227.0
1 533.0 529 850.0 830 227.0
2 542.0 540 923.0 850 160.0
3 544.0 545 1004.0 1022 183.0
4 554.0 600 812.0 837 116.0
mutate()
mutate()可以通过运算增加新列
# in R
> flights %>% slice(1:5) %>% select(origin, dest, distance) %>% mutate(distance_in_km = distance/0.6213712 )
# A tibble: 5 x 4
origin dest distance distance_in_km
<chr> <chr> <dbl> <dbl>
1 EWR IAH 1400 2253.
2 LGA IAH 1416 2279.
3 JFK MIA 1089 1753.
4 JFK BQN 1576 2536.
5 LGA ATL 762 1226.
# in Python
flights >> slice_rows(0, 5) >> select('origin', 'dest', 'distance') >> mutate(distance_in_km = 'distance/0.6213712')
Out[]:
origin dest distance distance_in_km
0 EWR IAH 1400 2253.081572
1 LGA IAH 1416 2278.831076
2 JFK MIA 1089 1752.575594
3 JFK BQN 1576 2536.326112
4 LGA ATL 762 1226.320113
plydata包同样支持使用自定义函数、以及Numpy、Pandas函数来创建新列
miles_to_km = lambda x: np.ceil(x/0.6213712)
flights >> slice_rows(0, 5) >> select('origin', 'dest', 'distance') >> mutate(distance_in_km = 'miles_to_km(distance)')
Out[519]:
origin dest distance distance_in_km
0 EWR IAH 1400 2254.0
1 LGA IAH 1416 2279.0
2 JFK MIA 1089 1753.0
3 JFK BQN 1576 2537.0
4 LGA ATL 762 1227.0
summarise() 和 group_by()
summarise()和group_by()可以实现分组聚合
- 选取1月1日的各机场起飞的航班数
# in R
> flights %>% filter(month == 1 & day == 1) %>% group_by(origin) %>% summarise(count = n())
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 2
origin count
<chr> <int>
1 EWR 305
2 JFK 297
3 LGA 240
# in Python
flights >> query("month == 1 & day == 1") >> group_by('origin') >> summarise(count = 'n()')
Out[]:
origin count
0 EWR 305
1 LGA 240
2 JFK 297
n() 返回值为该类的长度
总结
plydata的完成度目前来说已经很高了,大部分dplyr的功能都被完美移植了过来,并且与另外一个R移植包plotnine(ggplot2移植)很高的联动,应该会是笔者日后Python常用的包之一。
在这里感谢作者Hassan Kibirige,下篇文章会讲解他的另外一个包,也就是刚才提到大名鼎鼎的ggplot2在python中的移植版 - plotnine。
reference
Hassan Kibirige github主页 - https://github.com/has2k1
plydata github主页 - https://github.com/has2k1/plydata
plydata完整文档 - https://plydata.readthedocs.io/en/latest/index.html
dplyr github主页 - https://github.com/tidyverse/dplyr