Pythonic&性能优化

一、最简单的方式

使用python原生库和第三方包，不要自己造轮子，浪费时间，执行效率大概率低。

工作中常用到的枚举如下（np和pd就不提了）：队列、排序：deque（双向对列）、heapq、bisect。

3大好用标准库：itertools（迭代器函数）、 functools（偏函数之类，操作函数）、collections（比如创建个有序字典啊啥的）。

对于超出内存的大数据或半结构化数据，有网友推荐Blaze进行IO处理，本人未实践。

其他思路：GPU计算（PyCUDA、GPULib）、Python代码翻译为C、C++和LLVM的代码（Numba，NumbaPro还支持GPU）。

二、pythonic

字符串操作

##不推荐

colors = ['red', 'blue', 'green', 'yellow']

result = ''

for s in colors:

result += s # 每次赋值都丢弃以前的字符串对象, 生成一个新对象

##推荐

colors = ['red', 'blue', 'green', 'yellow']

result = ''.join(colors) # 没有额外的内存分配

Categoricals

Pandas将文本表示为对象类型，其中保持了Python普通string类型。这是常见的导致运行速度慢的原因，因为对象类型是以python中的对象类型运行的，而不是以正常的C语言的速度运行的。

Categoricals是一种新型并且强大的特征，它可以数字化分类数据，使用C语言的速度来解决文本数据。

df['gender'] = df['gender'].astype('category') # Categorize!

除静态方式外，还可以用动态方式创建Categoricals：

d = pd.Series(scores).describe()
score_ranges = [d['min']-1,d['mean'],d['max']+1]
score_labels = ['Role','Star']# 用pd.cut(ori_data, bins, labels) 方法
# 以 bins 设定的画界点来将 ori_data 归类，然后用 labels 中对应的 label 来作为分类名
df['level'] = pd.cut(df['score'],score_ranges,labels=score_labels)
print(df['level'])

三、还是不行怎么办

翻numpy的文档，用Google搜索，尝试使用PyPy，嗯。

附，参考资料：

1、sklearn文档，4.3. 预处理数据，https://www.studyai.cn/modules/preprocessing.html