读取csv文件
import pandas as pd
csv_path = 'gun_deaths_in_america.csv'
data_csv = pd.read_csv(csv_path,header=0)
data_csv.head()
image.png
data_csv.shape
(100798, 10)
%timeit pd.read_csv(csv_path,header=0)
114 ms ± 5.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
查看文件大小
查看本地文件大小
import os
os.stat('gun_deaths_in_america.csv').st_size # 单位是byte
4824404
查看占用内存大小
data_csv.memory_usage(deep=True).sum()
30368107
查看每一列占用内存大小
- object 类型占用内存空间很大
- int/float类型占用内存小
data_csv.memory_usage(deep=True)
Index 80
year 806384
month 806384
intent 6495168
police 806384
sex 6249476
age 806384
race 6322009
hispanic 806384
place 6463070
education 806384
dtype: int64
data_csv.dtypes
year int64
month int64
intent object
police int64
sex object
age float64
race object
hispanic int64
place object
education float64
dtype: object
保存为Pickle文件
直接保存为Pickle文件
保存为本地文件后,文件大小比原文件大。
data_csv.to_pickle('gun_deaths_in_america_before_transform.pkl')
pkl_path_before = 'gun_deaths_in_america_before_transform.pkl'
os.stat(pkl_path_before).st_size
5656925
对比文件读取速度
pickle文件的读取速度比csv文件读取速度快2倍 !
%timeit pd.read_csv(csv_path,header=0)
102 ms ± 7.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.read_pickle(pkl_path_before)
32.4 ms ± 5.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
类型转换后保存为Pickle文件
刚才看到object类型很占内存,可以将其转换为category类型。
data_csv.intent.astype('category').head()
0 Suicide
1 Suicide
2 Suicide
3 Suicide
4 Suicide
Name: intent, dtype: category
Categories (4, object): [Accidental, Homicide, Suicide, Undetermined]
先准换intent列,对比object的6495168,category的大小为object的1/65.
data_csv.intent.astype('category').memory_usage(deep=True)
101303
将所有数据转换成category类型
for col in data_csv.columns:
data_csv[col] = data_csv[col].astype('category')
查看转换后占用内存大小,相比转换前的303688107,转换后的内存大小减小57倍。
data_csv.memory_usage(deep=True).sum()
1018587
将转换后的数据保存为pickle文件,并查看pickle本地文件大小。相比转换前的4824404,转换后的文件的大小减小4倍。
data_csv.to_pickle('gun_deaths_in_america_after_transform.pkl')
pkl_path_after = 'gun_deaths_in_america_after_transform.pkl'
os.stat(pkl_path_after).st_size
1012643
对比文件读取速度,比转换前快42倍。
%timeit pd.read_pickle(pkl_path_after)
2.57 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.read_csv(csv_path,header=0)
106 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
综合对比
files = [csv_path,pkl_path_before,pkl_path_after]
对比本地文件大小
转换后的文件占用磁盘空间最小,比原文件小4倍,对于保存大量数据非常有用。
for file in files:
print('File size of the {0} is {1}: '.format(file,os.stat(file).st_size))
File size of the gun_deaths_in_america.csv is 4824404:
File size of the gun_deaths_in_america_before_transform.pkl is 5656925:
File size of the gun_deaths_in_america_after_transform.pkl is 1012643:
对比文件读取速度
转换后的读取速度比普通csv文件的读取速度快42倍。
%timeit pd.read_csv(csv_path,header=0)
97.5 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.read_pickle(pkl_path_before)
28.5 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.read_pickle(pkl_path_after)
2.18 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
对比占用内存大小
转换后占用内存比转换前小30倍。
for file in files:
if os.path.splitext(file)[1]=='.csv':
print('memory_usage of the {0} is : {1}'. \
format(file,pd.read_csv(file,header=0).memory_usage(deep=True).sum()))
else:
print('memory_usage of the {0} is : {1}'. \
format(file,pd.read_pickle(file).memory_usage(deep=True).sum()))
memory_usage of the gun_deaths_in_america.csv is : 30368107
memory_usage of the gun_deaths_in_america_before_transform.pkl is : 30368107
memory_usage of the gun_deaths_in_america_after_transform.pkl is : 1010827
读取的数据都是一样的,就是数据类型不一样。
image.png