python画KS图,求KS值

ks计算公式

ks用来衡量以一定阈值选定二分类模型预测结果集,各分类命中各自组内比重的差值,某一刻阈值使得此差值最大,此刻的差值就是ks值,ks越大代表模型可以更多地命中某类标签,同时尽可能地错判另一类的标签,具体公式如下:
$ks=max(\frac{Cum.B} {Bad total} - \frac{Cum.G} {Good total})$

数据输入

输入:predictions, labels,cut_point

predictions: 为每条样本的预测值组成的集合,预测概率在0-1之间
labels: 为每条样本的真实值(0, 1)组成的集合,本例中1是坏客户
cut_point: KS的阈值分割点的数量

数据预览,左列labels,右列predictions

head -4 test_predict_res.txt
0.0 0.831193
0.0 0.088209815
1.0 0.93411493
0.0 0.022157196

python代码实现

import numpy as np
import matplotlib.pyplot as plt
import matplotlib

matplotlib.rcParams["font.sans-serif"] = ["SimHei"]

def ks_plot(predictions, labels, cut_point=100):
    good_len = len([x for x  in labels if x == 0])  # 所有好客户数量
    bad_len = len([x for x in labels if x == 1])  # 所有坏客户数量
    predictions_labels = list(zip(predictions, labels))
    good_point = []
    bad_point = []
    diff_point = []  # 记录每个阈值点下的KS值

    x_axis_range = np.linspace(0, 1, cut_point)
    for i in x_axis_range:
        hit_data = [x[1] for x in predictions_labels if x[0] <= i]  # 选取当前阈值下的数据
        good_hit = len([x for x in hit_data if x == 0])  # 预测好客户数
        bad_hit = len([x for x in hit_data if x == 1])  # 预测坏客户数量
        good_rate = good_hit / good_len  # 预测好客户占比总好客户数
        bad_rate = bad_hit / bad_len  # 预测坏客户占比总坏客户数
        diff = good_rate - bad_rate  # KS值
        good_point.append(good_rate)
        bad_point.append(bad_rate)
        diff_point.append(diff)

    ks_value = max(diff_point)  # 获得最大KS值为KS值
    ks_x_axis = diff_point.index(ks_value)  # KS值下的阈值点索引
    ks_good_point, ks_bad_point = good_point[ks_x_axis], bad_point[ks_x_axis]  # 阈值下好坏客户在组内的占比
    threshold = x_axis_range[ks_x_axis]  # 阈值

    plt.plot(x_axis_range, good_point, color="green", label="好企业比率")
    plt.plot(x_axis_range, bad_point, color="red", label="坏企业比例")
    plt.plot(x_axis_range, diff_point, color="darkorange", alpha=0.5)
    plt.plot([threshold, threshold], [0, 1], linestyle="--", color="black", alpha=0.3, linewidth=2)
    
    plt.scatter([threshold], [ks_good_point], color="white", edgecolors="green", s=15)
    plt.scatter([threshold], [ks_bad_point], color="white", edgecolors="red", s=15)
    plt.scatter([threshold], [ks_value], color="white", edgecolors="darkorange", s=15)
    plt.title("KS={:.3f} threshold={:.3f}".format(ks_value, threshold))
    
    plt.text(threshold + 0.02, ks_good_point + 0.05, round(ks_good_point, 2))
    plt.text(threshold + 0.02, ks_bad_point + 0.05, round(ks_bad_point, 2))
    plt.text(threshold + 0.02, ks_value + 0.05, round(ks_value, 2))
    
    plt.legend(loc=4)
    plt.grid()
    plt.show()


if __name__ == "__main__":
    # 读取预测数据和真实标签
    labels = []
    predictions = []
    with open("test_predict_res.txt", "r", encoding="utf8") as f:
        for line in f.readlines():
            labels.append(float(line.strip().split()[0]))
            predictions.append(float(line.strip().split()[1]))

    ks_plot(predictions, labels)

ks_plot.png

KS图的解释

举例预测企业风险,预测概率越接近1是高风险企业,则当选取0.121作为分类器预测概率阈值时,有最大KS=0.526,也就是说如果判定模型预测结果大于0.121作为坏企业,会命中70%的坏企业,但是会有17%的好企业被错判.

使用sklearn.metrics.roc_curve计算KS

roc_curve输出三元素，分别是fpr, tpr, 阈值，ROC是在不同阈值下的一组fpr和tpr画得曲线图，代码如下

fpr, tpr, threshold = roc_curve(val_y, y_pred)

ROC曲线

其中：
fpr：假警报率，代表在这个阈值以上都认定为坏，实际却是好，这块认定为坏却是好的样本占全部好的比例
tpr：命中率，代表在这个阈值以上都认定为坏，实际也是坏，这块认定为坏确实为坏的样本占全部坏的比例

ROC以fpr我以横坐标，tpr为纵坐标，从左到右阈值从0到1，大于这个阈值认定为坏或者为正，当然希望fpr越小tpr越大越好，座椅曲线向左上角偏移。

fpr,tpr

再看KS，ROC的fpr正好是KS某阈值下的好客户数占比总好客户数，tpr是坏客户数占比总坏客户数，而区别是KS中阈值作为横坐标，tpr和fpr作为纵轴表现，ROC中fpr作为横轴，tpr作为纵轴。
roc_curve输出的阈值如下

threshold
Out[107]: 
array([1.91122128, 0.91122128, 0.89363161, ..., 0.25025188, 0.24763843,
       0.23871425])

这个阈值是从大到小的，对应的fpr和tpr都是大于这个阈值的情况下统计的结果

threshold[::-1]
Out[108]: 
array([0.23871425, 0.24763843, 0.25025188, ..., 0.89363161, 0.91122128,
       1.91122128])

另外threshold长度不定，最小值是预测值集合中的最小概率值，最大值是预测值集合中的最大值+1（该阈值无意义，仅仅为了代码实现），就是说排第二大的值是预测值集合中的最大值，因此画KS需要把阈值中的最大值去除，然后由于两端受限于预测集合的打分，因此可能两端存在缺口，需要手动补齐，画图如下

import matplotlib

matplotlib.rcParams["font.sans-serif"] = ["SimHei"]
import matplotlib.pyplot as plt

fpr, tpr, threshold = roc_curve(val_y, y_pred)
diff = tpr_lr - fpr
fpr = fpr.tolist()
tpr = tpr.tolist()
diff = diff.tolist()
threshold = threshold.tolist()
ks = abs(tpr_lr - fpr).max()
ks_threshold = threshold[diff.index(ks)]
ks_good_point = fpr[diff.index(ks)]
ks_bad_point = tpr[diff.index(ks)]

threshold[0] = 1.0  # 最大值改为1
threshold.append(0.0)  # 加入一个最小值
fpr.append(1)  # 填充
tpr.append(1)  # 填充
diff.append(0)  # 填充

plt.plot([ks_threshold, ks_threshold], [0, 1], linestyle="--", color="black", alpha=0.3, linewidth=2)

plt.scatter([ks_threshold], [ks_good_point], color="white", edgecolors="green", s=15)
plt.scatter([ks_threshold], [ks_bad_point], color="white", edgecolors="red", s=15)
plt.scatter([ks_threshold], [ks], color="white", edgecolors="darkorange", s=15)
plt.title("KS={:.3f} threshold={:.3f}".format(ks, ks_threshold))

plt.text(ks_threshold + 0.02, ks_good_point + 0.05, round(ks_good_point, 2))
plt.text(ks_threshold + 0.02, ks_bad_point + 0.05, round(ks_bad_point, 2))
plt.text(ks_threshold + 0.02, ks + 0.05, round(ks, 2))

plt.plot(threshold, fpr, c="green", label="good_rate")
plt.plot(threshold, tpr, c="red", label="bad rate")
plt.plot(threshold, diff, c="orange", label="ks")
plt.gca().invert_xaxis()
plt.legend(loc=4)
plt.grid()
plt.show()

结果如下，这个图和上面的方法横轴是反着的，一个是阈值以上，一个是阈值以下

ks图