numpy pandas实现sql groupby

今天是学习使用pandas的第一天，之前一直使用sql比较多，今天需要在另外一个数据环境做分析，awk脚本需要写的比较长，python命令行的写法不够灵活。需要把sql 中group by＋casewhen 的功能翻译成为pandas版本的。

sql功能：

使用pandas实现如下：

import pandas as pd

import numpy as np

import argparse

＃step1:数据读入：

df = pd.read_csv(in_path, sep='\t', header=None, na_values='9999')

step2.1:计算score 分bin：

df['score'] = df['score'].astype(int)

df['scoreTag'] = np.where(df['score']<500,-1,np.where(df['score']>700,999,np.ceil(df['score']/20)))

#step2.2计算id bin（实现区分flag >=2 功能）：

df['id_dpd'] = np.where(df['flag']>=2,df['id'],'NA')

#step2.3:计算金额分bin：

df['amt'] = df['amt'].fillna(0).astype(float)

df['amtFlag'] = np.ceil(df['amt']/10000)

df['principal_payable_dpd'] = np.where(df['Tag_Timing']>=2,df['principal_payable_sum'],0)

#step3:最终分组：

df_SAF= df.groupby(['segment','amtFlag','scoreTag','userType']).agg({

'id':lambda y:len(y.unique()),

'id_dpd':lambda y:len(y.dropna().unique()),

'principal_payable_sum':lambda y:y.sum(),

'principal_payable_dpd':lambda y:y.sum()

})

以上，便实现了讲sql的group by ＋case when 使用pandas完成了

其中中间有一部分出现bug，最初写法是math.ceil(df['amt']/10000)，提示cannot convert the series to <type 'float'>，查了很久才意识到，math只能对单个数字操作，无法实现列操作，修改为np.ceil(df['amt']/10000)，还是对数据结构不明确导致。

numpy pandas实现sql groupby

sql功能：

使用pandas实现如下：

推荐阅读更多精彩内容