Sklearn-Pandas-Numpy 小课堂

本文仅用于记录日常使用sklearn、numpy、pandas过程中使用到的一些小函数，方便日后复用。

numpy 和 pandas 设置在print输出时不使用科学计数法

import pandas as pd
import numpy as np


np.set_printoptions(precision=3, suppress=True)
np.set_printoptions(formatter={'float': '{: 0.3f}'.format})
pd.set_option('precision', 5) #设置精度
pd.set_option('display.float_format', lambda x: '%.5f' % x) #为了直观的显示数字，不采用科学计数法

# jupyter notebook中设置交互式输出
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#显示所有列
pd.set_option('display.max_columns', None)
#显示所有行
pd.set_option('display.max_rows', 500)
#设置value的显示长度为100，默认为50
pd.set_option('max_colwidth',100)

pytorch 非科学计数法输出

# pytorch 非科学计数法输出
import torch
torch.set_printoptions(
    precision=2,    # 精度，保留小数点后几位，默认4
    threshold=1000,
    edgeitems=3,
    linewidth=150,  # 每行最多显示的字符数，默认80，超过则换行显示
    profile=None,
    sci_mode=False  # 用科学技术法显示数据，默认True
)

1、如何合并一个稀疏矩阵和一个稠密矩阵？

此问题背景是使用sklearn生成tfidf特征时是一个稀疏特征矩阵，但是有时候还需要考虑加入其他特征，这些特征常常是稠密矩阵（pandas其他列）。

from scipy import sparse
import numpy as np

A = sparse.csr_matrix([[1,0,0],[0,1,0]])
B = np.array([1,2])

# 合并为稠密矩阵
np.column_stack((A.A, B))
# 输出
array([[1, 0, 0, 1],
       [0, 1, 0, 2]], dtype=int64)

# 合并为稀疏矩阵
sparse.hstack((A,sparse.csr_matrix(B).T))
# 输出
<2x4 sparse matrix of type '<class 'numpy.int64'>'
    with 4 stored elements in COOrdinate format>

sparse.hstack((A,sparse.csr_matrix(B).T)).toarray()
# 输出
array([[1, 0, 0, 1],
       [0, 1, 0, 2]], dtype=int64)

2、sklearn labelencoder如何处理OOV问题？

在pyspark中，stringIndex可以非常方便的处理OOV问题——'skip'或者'keep'。
但是sklearn的labelencoder并没有这种功能。我们需要自己来处理OOV问题。

from sklearn.preprocessing import LabelEncoder

le = preprocessing.LabelEncoder()
le.fit(X)

le_dict = dict(zip(le.classes_, le.transform(le.classes_)))
df[your_col].apply(lambda x: le_dict.get(x, <unknown_value>))

参考：https://stackoverflow.com/questions/21057621/sklearn-labelencoder-with-never-seen-before-values

3、pandas groupby后如何根据某一列的值对group内的数据进行排序并获取top n行？

样例数据如下：

df1.groupby(["平台","站点","媒体"]).apply(lambda x : 
  x.sort_values(by = "计费收入(精度保留)", ascending = False).head(5).reset_index(drop = True))

输出如下：

根据多个字符串来过滤DataFrame行

参考 https://stackoverflow.com/questions/43389163/apply-multiple-string-containment-filters-to-pandas-dataframe-using-dictionary

# 某列的值包含多个字符串中的一个
df[df["col"].str.contains("a|b|c")]

# 某列的值在某个字符串集合中
df[df["col"].isin(["a", "b", "c")]

多层索引时如何根据索引来进行过滤行

通过df.index.get_level_values(n)来进行过滤即可, 如

df[df.index.get_level_values(0).isin(MEDIA_TAGS)]

参考 https://stackoverflow.com/questions/25224545/filtering-multiple-items-in-a-multi-index-python-panda-dataframe.

简单的pandas apply加速替换的方法

%timeit list(map(divide, df['A'], df['B']))                                   # 43.9 ms
%timeit np.vectorize(divide)(df['A'], df['B'])                                # 48.1 ms
%timeit [divide(a, b) for a, b in zip(df['A'], df['B'])]                      # 49.4 ms
%timeit [divide(a, b) for a, b in df[['A', 'B']].itertuples(index=False)]     # 112 ms
%timeit df.apply(lambda row: divide(*row), axis=1, raw=True)                  # 760 ms
%timeit df.apply(lambda row: divide(row['A'], row['B']), axis=1)              # 4.83 s
%timeit [divide(row['A'], row['B']) for _, row in df[['A', 'B']].iterrows()]  # 11.6 s

https://stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c

多层索引降为单层索引

# 方法1，合并多层索引的名称
df.columns=["_".join(x) for x in df.columns.ravel()]

# 方法2，直接drop多层索引，若不止两层可进行多次drop
df.columns=df.columns.droplevel(1)

最后编辑于：2022.03.29 11:11:49