金融风控AI—评分卡模型算法(3)

4、模型训练

a、WOE值替换

在上一篇文章我们已经获取了每个变量值的分箱数据和woe值，现在我们用woe值替换各变量数据的分箱号，也就是原来的分箱数据是表示各个数值对应箱子号，现在替换成对应woe值。

实现代码如下：

def replace_data(cut,cut_woe):

a=[]

for i in cut.unique():

a.append(i)

a.sort()

for m in range(len(a)):

cut.replace(a[m],cut_woe.values[m],inplace=True)

return cut

df_new=pd.DataFrame() #新建df_new存放woe转换后的数据

df_new["SeriousDlqin2yrs"]=train["SeriousDlqin2yrs"]

df_new["RevolvingUtilizationOfUnsecuredLines"]=replace_data(cut1,cut1_woe)

df_new["age"]=replace_data(cut2,cut2_woe)

df_new["NumberOfTime30-59DaysPastDueNotWorse"]=replace_data(cut3,cut3_woe)

df_new["DebtRatio"]=replace_data(cut4,cut4_woe)

df_new["MonthlyIncome"]=replace_data(cut5,cut5_woe)

df_new["NumberOfOpenCreditLinesAndLoans"]=replace_data(cut6,cut6_woe)

df_new["NumberOfTimes90DaysLate"]=replace_data(cut7,cut7_woe)

df_new["NumberRealEstateLoansOrLines"]=replace_data(cut8,cut8_woe)

df_new["NumberOfTime60-89DaysPastDueNotWorse"]=replace_data(cut9,cut9_woe)

df_new["NumberOfDependents"]=replace_data(cut10,cut10_woe)

这样替换后的df_new数据看一下

这样看的话是不是有点感觉了。我们通过woe变换后，把这个建模问题变成实数域的逻辑回归模型。

具体逻辑回归的原理这里不展开了，我们直接调用statsmodels包来实现逻辑回归：

import statsmodels.api as sm

from sklearn.metrics import roc_curve, auc

x=df_new.iloc[:,1:]

y=df_new['SeriousDlqin2yrs']

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)

X1=sm.add_constant(x_train)

logit=sm.Logit(y_train,X1)

result=logit.fit()

print(result.summary())

最后打印出结果

逻辑回归模型结果

第一列coef很重要，是变量的特征权值系数，后面转换为打分规则时会用到。

5、模型评估

到这里，我们的模型基本完成了。需要验证一下模型的效果怎么样。一般通过ROC曲线和AUC来评估模型的拟合能力。利用sklearn.metrics，它能方便的计算ROC和AUC。并画图看效果。

X3 = sm.add_constant(x_test)

y_pred = result.predict(X3)

fpr, tpr, threshold = roc_curve(y_test, y_pred)

#print(y_pred)

roc_auc = auc(fpr, tpr)

#rocauc = auc(fpr, tpr)

plt.plot(fpr, tpr, 'b', label='AUC = %0.2f' % roc_auc)

plt.legend(loc='lower right')

plt.plot([0, 1], [0, 1], 'r--')

plt.xlim([0, 1])

plt.ylim([0, 1])

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.show()

直接看图

auc=83%，一般大于75%都是比较好的。

因为网上看到这个项目都是85%的auc，所以我就想查查那个因素对auc比较大。

回顾之前的内容，我们取变量的时候选iv>0.02,也就是变量都入选。那我们选IV>0.1,也就是剔除5个变量。出图如下

虽然图有差异但是auc不变，我又试了把月收入采用随机森林填补空缺，效果也没差异。

最后我回溯到把 “RevolvingUtilizationOfUnsecuredLines”这个参数大于1的剔除数据加入训练。

这次auc=85%了。反正前面文章提到的各种因素都可以改动试试，看看效果。

6、建立评分系统

我们已经基本完成了建模相关的工作，并用ROC曲线验证了模型的预测能力。接下来的步骤，就是将Logistic模型转换为标准评分卡的形式。

这个问题就变成 score=offset+factor*coe*x。因为factor*coe是常数所以这是线性函数了。这里offset是常数叫基础分值，factor是 PDO（比率翻倍的分值），coe就是前面逻辑回归得到的coef变量的特征权值系数。x就是各个变量的woe值。factor和offset有经验算法获得：需要注意的是coe有个常数项coe[0]，此时x取1，这样就在offset加了个常数偏置。

factor = 20 / np.log(2)

offset = 600 - 20 * np.log(20) / np.log(2)

score=offset+factor*coe*x

具体计算各个变量的分值

def get_score(coe,woe,factor):

scores=round(coe*woe*factor,0)

return scores

x1 = get_score(coe[1], cut1_woe, factor)

x2 = get_score(coe[2], cut2_woe, factor)

x3 = get_score(coe[3], cut3_woe, factor)

x4 = get_score(coe[4], cut4_woe, factor)

x5 = get_score(coe[5], cut5_woe, factor)

x6 = get_score(coe[6], cut6_woe, factor)

x7 = get_score(coe[7], cut7_woe, factor)

x8 = get_score(coe[8], cut8_woe, factor)

x9 = get_score(coe[9], cut9_woe, factor)

x10 = get_score(coe[10], cut10_woe, factor)

print("可用额度比值对应的分数:{}".format(x1))

print("年龄对应的分数:{}".format(x2))

print("逾期30-59天笔数对应的分数:{}".format(x3))

print("负债率对应的分数:{}".format(x4))

print("月收入对应的分数:{}".format(x5))

print("信贷数量对应的分数:{}".format(x6))

print("逾期90天笔数对应的分数:{}".format(x7))

print("固定资产贷款量对应的分数:{}".format(x8))

print("逾期60-89天笔数对应的分数:{}".format(x9))

print("家属数量对应的分数:{}".format(x10))

结果类似这样；

具体计算分值就是那这个表，按照变量所在的cut得到数值，然后累加。下面这部分代码直接把输入的测试记录转成分值

结果看一下

终于写完了，不过还是觉得里面还有很多细节可以好好研究，比如分箱数量，分箱算法等。

附上数据和代码。