0. 定义一个函数
在jupyter中加载基本line_profiler (用于lprun)
%load_ext line_profiler
### 注意 line_profiler不要用conda安装,用pip
import numpy as np
import pandas as pd
### 加载我的数据
df = pd.read_csv('new_york_hotels.csv', encoding='cp1252')
def haversine(lat1, lon1, lat2, lon2):
miles_constant = 3959
lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
dlat = lat2 - lat1
dlon = lon2 - lon1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
mi = miles_constant * c
return mi
haversine函数,计算地球上两个点之间的弧面距离。
1. 首先使用pandas 的 iterrows()方法看一下
%%timeit
# Haversine applied on rows via iteration
haversine_series = []
for index, row in df.iterrows():
haversine_series.append(haversine(40.671, -73.985,\
row['latitude'], row['longitude']))
df['distance'] = haversine_series
结果:
iterrows方法用了35ms
2. 再用apply 方法看一下
%timeit df['distance'] =\
df.apply(lambda row: haversine(40.671, -73.985,\
row['latitude'], row['longitude']), axis=1)
结果:
apply方法,18.8ms,时间缩短将近一半
对apply方法进行profiling
# Haversine applied on rows
%lprun -f haversine \
df.apply(lambda row: haversine(40.671, -73.985,\
row['latitude'], row['longitude']), axis=1)
结果:
可见在此次运算中numpy的向量运算所占时间比例最大
3. 基于pandas的向量运算
# Vectorized implementation of Haversine applied on Pandas series
%timeit df['distance'] = haversine(40.671, -73.985,\
df['latitude'], df['longitude'])
结果:
483 微秒
profiling一下:
# Vectorized implementation profile
%lprun -f haversine haversine(40.671, -73.985,\
df['latitude'], df['longitude'])
结果:
依然是向量运算耗时最多
4. 基于 numpy的向量运算
# Vectorized implementation of Haversine applied on NumPy arrays
%timeit df['distance'] = haversine(40.671, -73.985,\
df['latitude'].values, df['longitude'].values)
结果:
64.3微秒
profiling一下
%lprun -f haversine df['distance'] = haversine(40.671, -73.985,\
df['latitude'].values, df['longitude'].values)
结果:
虽然还是向量运算耗时占比最多,但每次运算绝对耗时大大下降。
5. 使用cython将其转化为C
加载cython
%load_ext cython
### cython我是用conda安装的,pip安装的反而不行
重新用 cpdef定义函数
%%cython -a
# Haversine cythonized (no other edits)
import numpy as np
cpdef haversine_cy(lat1, lon1, lat2, lon2):
miles_constant = 3959
lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
dlat = lat2 - lat1
dlon = lon2 - lon1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
mi = miles_constant * c
return mi
编译结果:
5.1 基于Cython的apply方法:
%timeit df['distance'] =\
df.apply(lambda row: haversine_cy(40.671, -73.985,\
row['latitude'], row['longitude']), axis=1)
结果:
19ms
5.2 基于Cython的numpy向量运算
# Vectorized implementation of Haversine applied on NumPy arrays
%timeit df['distance'] = haversine_cy(40.671, -73.985,\
df['latitude'].values, df['longitude'].values)
6. 结论
不基于cython | 基于cython | |
---|---|---|
pandas iterrows | 35ms | |
apply | 18.8ms | 19ms |
pandas 向量运算 | 483µs | |
numpy 向量运算 | 64.3µs | 64.8µs |
可见,运算速度numpy向量运算>pandas向量运算>apply>iterrows
cython不一定能加快运算速度,复杂情况下可以试试,简单情况用numpy向量运算就行了。
Ref:
Umich MADS 515 Efficient Data Processing