Lending Club Loan Data 数据分析

背景介绍：

Lending Club 创立于2006年，主营业务是为市场提供P2P贷款的平台中介服务，公司总部位于旧金山。公司在运营初期仅提供个人贷款服务，贷款人向Lending Club平台申请贷款时，Lending Club通过线上或线下让客户填写贷款申请表，收集客户的基本信息，同时会借助第三方平台的征信机构的信息。
通过这些信息属性来做逻辑回归生成预测模型，Lending Club可以通过预测判断贷款人是否会违约，从而决定是否向申请人发放贷款。

数据集来源：LendingClub官网 07年—11年的数据：

https://www.lendingclub.com/statistics/additional-statistics?

引入包和数据集

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
%matplotlib inline
warnings.filterwarnings('ignore')
plt.style.use('ggplot')
Loandata = pd.read_csv('C:/Users/Jason/Desktop/DAdata/LoanStats3a_securev2.csv',skiprows=1)

一、查看数据集基本情况

Loandata.shape

(39786, 150)
每一行是一条数据，150个字段，字段信息如下：

字段信息.png

Loandata.iloc[0]

查看第一条字段的信息

二、数据可视化分析前的数据预处理

1、删除特征中只有一种属性的列

orig_columns = Loandata.columns

drop_columns = []

for col in orig_columns:
    col_series = Loandata[col].dropna().unique()  #去重唯一的属性
    if len(col_series) == 1:  #如果该特征的属性只有一个属性，就给过滤掉该特征
        drop_columns.append(col)
        
Loandata = Loandata.drop(drop_columns, axis=1)
print(drop_columns)

['pymnt_plan', 'out_prncp', 'next_pymnt_d', 'collections_12_mths_ex_med', 'mths_since_last_major_derog', 'application_type', 'verification_status_joint', 'acc_now_delinq', 'bc_util', 'chargeoff_within_12_mths', 'delinq_amnt', 'percent_bc_gt_75', 'tax_liens', 'sec_app_mths_since_last_major_derog', 'hardship_flag', 'hardship_last_payment_amount']

2、删除缺失值超过二分之一的字段

half_count = len(Loandata)/2
Loandata = Loandata.dropna(thresh=half_count,axis=1)
Loandata.shape

(39786, 50)
还剩下50个字段

Loandata.isnull().sum()

查看有空值的字段

id                             0
loan_amnt                      0
funded_amnt                    0
funded_amnt_inv                0
term                           0
int_rate                       0
installment                    0
grade                          0
sub_grade                      0
emp_title                   2467
emp_length                  1078
home_ownership                 0
annual_inc                     0
verification_status            0
issue_d                        0
loan_status                    0
url                            0
desc                       12967
purpose                        0
title                         11
zip_code                       0
addr_state                     0
dti                            0
delinq_2yrs                    0
earliest_cr_line               0
fico_range_low                 0
fico_range_high                0
inq_last_6mths                 0
open_acc                       0
pub_rec                        0
revol_bal                      0
revol_util                    50
total_acc                      0
initial_list_status            0
out_prncp_inv                  0
total_pymnt                    0
total_pymnt_inv                0
total_rec_prncp                0
total_rec_int                  0
total_rec_late_fee             0
recoveries                     0
collection_recovery_fee        0
last_pymnt_d                  71
last_pymnt_amnt                1
last_credit_pull_d             2
last_fico_range_high           0
last_fico_range_low            0
policy_code                    0
pub_rec_bankruptcies         697
debt_settlement_flag           1
dtype: int64

空值比较多的列，如：desc，emp_title等对于分析和建模都没有帮助，所以将其删除，id，url，zip_code等也一并删除

Loandata = Loandata.drop(['id','url','desc','title','emp_title','zip_code'],axis=1)

Loandata.isnull().sum()

loan_amnt                     0
funded_amnt                   0
funded_amnt_inv               0
term                          0
int_rate                      0
installment                   0
grade                         0
sub_grade                     0
emp_length                 1078
home_ownership                0
annual_inc                    0
verification_status           0
issue_d                       0
loan_status                   0
purpose                       0
addr_state                    0
dti                           0
delinq_2yrs                   0
earliest_cr_line              0
fico_range_low                0
fico_range_high               0
inq_last_6mths                0
open_acc                      0
pub_rec                       0
revol_bal                     0
revol_util                   50
total_acc                     0
initial_list_status           0
out_prncp_inv                 0
total_pymnt                   0
total_pymnt_inv               0
total_rec_prncp               0
total_rec_int                 0
total_rec_late_fee            0
recoveries                    0
collection_recovery_fee       0
last_pymnt_d                 71
last_pymnt_amnt               1
last_credit_pull_d            2
last_fico_range_high          0
last_fico_range_low           0
policy_code                   0
pub_rec_bankruptcies        697
debt_settlement_flag          1
dtype: int64

# 采用labelencoder处理 emp_length
label_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        None: 0
    }
}
Loandata = Loandata.replace(label_dict)

3、将issue_d这一列从字符串转换为时间格式，并查看是否转换后有空值,然后按时间先后排序

Loandata['issue_d'] = pd.to_datetime(Loandata['issue_d'])
Loandata['issue_d'].isnull().any()
-->> False
# 按时间排序
Loandata = Loandata.sort_values(by=['issue_d'],ascending=True)
Loandata = Loandata.reset_index(drop=True)

把费率这个字段做一个处理

Loandata["int_rate"] = Loandata["int_rate"].str.rstrip("%").astype("float")

三、我们先来做一个初步数据分析

1、查看贷款人数最多的州：

Loandata.addr_state.value_counts()[:20].plot(kind='bar', figsize=(8, 4),title='StateLoan Count')

State.png

因为Lending Club总部在加州，对本地业务开拓比较深，所以加州的笔数远远高于其他州，其次是纽约州、佛罗里达州和德克萨斯州

2、查看坏账率

Loandata['loan_status'].value_counts()
-->>Fully Paid     34116
    Charged Off     5670
Name: loan_status, dtype: int64

# 对还款情况做一个编码
badloan = ['Charged Off']
Loandata['loan_condition'] = np.nan
def loan_condition(status):
    if status in badloan:
        return 0
    else:
        return 1
Loandata['loan_condition'] = Loandata['loan_status'].apply(loan_condition)
print('goodload 1: badloan 0')
print(Loandata['loan_condition'].value_counts())
-->>goodload 1: badloan 0
    1    34116
    0     5670
Name: loan_condition, dtype: int64

loan_condition.png

3、每年放款交易额

Loandata['year'] =Loandata['issue_d'].dt.year
sns.countplot('year',data=Loandata)
plt.title('Loan Amount by Year',fontsize=10)

year.png

每年的贷款笔数和贷款金额在逐年上升

4、客户贷款金额和期数的选择

plt.hist(Loandata.loan_amnt,bins=10,edgecolor='white',color='dodgerblue')

amount.png

Loandata['term'].value_counts()
-->> 36 months    29096
     60 months    10690
Name: term, dtype: int64

4000-12000 的贷款人数是最多的，大部分人选择36期还款

5、利率的范围

print(Loandata.int_rate.describe())
sns.distplot(Loandata.int_rate)
-->>count    39786.000000
mean        12.027873
std          3.727466
min          5.420000
25%          9.250000
50%         11.860000
75%         14.590000
max         24.590000
Name: int_rate, dtype: float64

rate.png

利率平均值是12%，总体范围在5.4%~24.59%

四、初步分析完毕，开始建模部分，但是在此之间还要对数据进行处理，删除对于建模帮助不大的字段，减少模型计算量，而且由于sk-learn不接受字符串类型的数据，还需做缺失值字符串、标点符号、%号、字符值等的处理

Loandata.columns
-->>Index(['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate',
       'installment', 'grade', 'sub_grade', 'emp_length', 'home_ownership',
       'annual_inc', 'verification_status', 'issue_d', 'loan_status',
       'purpose', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line',
       'fico_range_low', 'fico_range_high', 'inq_last_6mths', 'open_acc',
       'pub_rec', 'revol_bal', 'revol_util', 'total_acc',
       'initial_list_status', 'out_prncp_inv', 'total_pymnt',
       'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
       'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
       'last_pymnt_d', 'last_pymnt_amnt', 'last_credit_pull_d',
       'last_fico_range_high', 'last_fico_range_low', 'policy_code',
       'pub_rec_bankruptcies', 'debt_settlement_flag', 'loan_condition',
       'year'],
      dtype='object')

目前还有比较多的字段，可能在实际工作中，模型字段的保留与删除与否，将会是一个重要的工程，在这里我就删除一些对建模无用的字段，比如：to迄今收到的本金，期望贷款金额，邮编等

Loandata = Loandata.drop(["funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "issue_d"], axis=1)
Loandata = Loandata.drop(["out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)
Loandata = Loandata.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)
Loandata.head(1)
-->>
loan_amnt   term    int_rate    installment emp_length  home_ownership  annual_inc  verification_status loan_status purpose ... total_acc   initial_list_status last_credit_pull_d  last_fico_range_high    last_fico_range_low policy_code pub_rec_bankruptcies    debt_settlement_flag    loan_condition  year
0   7500    36 months   13.75   255.43  0   OWN 22000.0 Not Verified    Fully Paid  debt_consolidation  ... 8   f   20-Jan  719 715 1   NaN N   1   2007

还剩下31个字段

null_counts = Loandata.isnull().sum()
null_counts
-->>
loan_amnt                 0
term                      0
int_rate                  0
installment               0
emp_length                0
home_ownership            0
annual_inc                0
verification_status       0
loan_status               0
purpose                   0
addr_state                0
dti                       0
delinq_2yrs               0
earliest_cr_line          0
fico_range_low            0
fico_range_high           0
inq_last_6mths            0
open_acc                  0
pub_rec                   0
revol_bal                 0
revol_util               50
total_acc                 0
initial_list_status       0
last_credit_pull_d        2
last_fico_range_high      0
last_fico_range_low       0
policy_code               0
pub_rec_bankruptcies    697
debt_settlement_flag      1
loan_condition            0
year                      0
dtype: int64

revol_util 去掉%并转成float

Loandata["revol_util"] = Loandata["revol_util"].str.rstrip("%").astype("float")

缺失值并不多，丢弃也无妨，当然也可以最大值、最小值、平均值等填充

Loandata = Loandata.drop("pub_rec_bankruptcies", axis=1)
Loandata = Loandata.dropna(axis=0)

Loandata = Loandata.drop(['debt_settlement_flag', 'policy_code','initial_list_status','earliest_cr_line','addr_state','loan_status'],axis=1)

把剩下的几个字符串类型字段做一个标签编码

import sklearn.preprocessing as sp

lbe = sp.LabelEncoder()
Loandata['home_ownership'] = lbe.fit_transform(Loandata['home_ownership'])
lbe = sp.LabelEncoder()
Loandata['verification_status'] = lbe.fit_transform(Loandata['verification_status'])
lbe = sp.LabelEncoder()
Loandata['purpose'] = lbe.fit_transform(Loandata['purpose'])
lbe = sp.LabelEncoder()
Loandata['term'] = lbe.fit_transform(Loandata['term'])

把剩下数值型的字段转成int型

Loandata['total_acc'] = Loandata['total_acc'].astype('int64')
Loandata['revol_bal'] = Loandata['revol_bal'].astype('int64')
Loandata['delinq_2yrs'] = Loandata['delinq_2yrs'].astype('int64')

Loandata.head()
-->>
loan_amnt   term    int_rate    installment emp_length  home_ownership  annual_inc  verification_status purpose dti ... fico_range_high inq_last_6mths  open_acc    pub_rec revol_bal   revol_util  total_acc   last_fico_range_high    last_fico_range_low loan_condition
0   7500    0   13.75   255.43  0   3   22000.0 0   2   14.29   ... 664 0   7   0   4175    51.5    8   719 715 1
1   3500    0   10.28   113.39  0   4   20000.0 0   8   1.50    ... 684 0   17  0   1882    32.4    18  829 825 1
2   5750    0   7.43    178.69  10  0   125000.0    0   2   0.27    ... 794 0   10  0   2817    10.2    16  799 795 1
3   5000    0   7.43    155.38  6   4   40000.0 0   0   2.55    ... 774 2   4   0   2562    14.0    7   729 725 1
4   1200    0   11.54   39.60   0   4   20000.0 0   1   2.04    ... 664 2   3   0   1153    75.8    4   704 700 1
5 rows × 22 columns

数据清洗完毕，剩下22个字段用作模型训练，将干净的数据重新保存并读取

Loandata.to_csv("C:/Users/Jason/Desktop/CleanLoanData.csv", index=False)
Loandata=pd.read_csv("C:/Users/Jason/Desktop/CleanLoanData.csv")

五、利用逻辑回归实现客户逾期预测

5.1

import sklearn.linear_model as lm
model = lm.LogisticRegression()
cols = Loandata.columns
train_cols = cols.drop('loan_condition')
x = Loandata[train_cols]
y = Loandata['loan_condition']
model.fit(x,y)
predict = model.predict(x)
predict[:10]
-->>array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int64)

0 代表没还，1代表还了，这么高的还款率，似乎有点不对。让我们看看model的模型概率

model.predict_proba(x)
-->>
array([[0.03725216, 0.96274784],
       [0.00711186, 0.99288814],
       [0.02119685, 0.97880315],
       ...,
       [0.18928953, 0.81071047],
       [0.04177887, 0.95822113],
       [0.06569009, 0.93430991]])

5.2 等等，让我们想一想，拿什么衡量我们模型的好坏呢，我们结合实际，我们借钱出去给有能力还款的人，每笔赚取10%的利润，十个人中假设一个人没还款，损失100%，但是需要预测对十个人才能弥补预测错一个人的收益，显然精度是不合适此模型，为了实现利润最大化，所以需要模型预测更高的recall率，故采用两个指标：TPR(True Poositive Rate)更高，FPR(False Positive Rate)更低

实际值        预测值           盈亏  
0              1            -1000         FP
1              1              100         TP
1              0                0         FN
0              0                0         TN

fp_ = (predict ==1) & (Loandata['loan_condition']==0)
fp = len(predict[fp_])
print(fp)
tp_ = (predict ==1) & (Loandata['loan_condition']==1)
tp = len(predict[tp_])
print(tp)
fn_ = (predict == 0) & (Loandata["loan_condition"] == 1)
fn = len(predict[fn_])
print(fn)
tn_ = (predict ==0) & (Loandata['loan_condition']==0)
tn = len(predict[tn_])
print(tn)
-->>
4414
33118
962
1239

5.3 建立一个混淆矩阵

import sklearn.model_selection as sm
model = lm.LogisticRegression()
predict = sm.cross_val_predict(model,x,y,cv=10)
predict = pd.Series(predict)
predict[:100]
-->>
0     1
1     1
2     1
3     1
4     1
5     1
6     1
7     1
8     1
9     1
10    1
11    1
12    1
13    1
14    1
15    1
16    1
17    1
18    1
19    1
20    1
21    1
22    1
23    1
24    1
25    1
26    1
27    1
28    1
29    1
     ..
70    1
71    1
72    1
73    1
74    1
75    1
76    1
77    1
78    1
79    1
80    1
81    1
82    0
83    0
84    1
85    1
86    1
87    1
88    1
89    1
90    1
91    1
92    1
93    1
94    1
95    0
96    1
97    1
98    1
99    1
Length: 100, dtype: int64

fp_ = (predict ==1) & (Loandata['loan_condition']==0)
fp = len(predict[fp_])
print(fp)
tp_ = (predict ==1) & (Loandata['loan_condition']==1)
tp = len(predict[tp_])
print(tp)
fn_ = (predict == 0) & (Loandata["loan_condition"] == 1)
fn = len(predict[fn_])
print(fn)
tn_ = (predict ==0) & (Loandata['loan_condition']==0)
tn = len(predict[tn_])
print(tn)
--->>
4420
33127
953
1233

tpr = tp/float((tp+fn))
fpr = fp/float((fp+tn))
print(tpr)
print(fpr)
-->>
0.9720363849765258
0.781885724394127

5.4 TPR和FPR的值都很高，显然不是我们想要的，考虑到数据集样本权重差异较大，下一步我们调整权重再训练一次(默认权重)

model = lm.LogisticRegression(class_weight='balanced')
predict = sm.cross_val_predict(model,x,y,cv=10)
predict = pd.Series(predict)
fp_ = (predict ==1) & (Loandata['loan_condition']==0)
fp = len(predict[fp_])
print(fp)
tp_ = (predict ==1) & (Loandata['loan_condition']==1)
tp = len(predict[tp_])
print(tp)
tn_ = (predict ==0) & (Loandata['loan_condition']==0)
tn = len(predict[tn_])
print(tn)
fn_ = (predict == 0) & (Loandata["loan_condition"] == 1)
fn = len(predict[fn_])
print(fn)
tpr = tp/float((tp+fn))
fpr = fp/float((fp+tn))
print(tpr)
print(fpr)
-->>
1517
26393
4136
7687
0.7744424882629108
0.26835308685653636

5.5 自定义权重

penalty = {
    0:6,
    1:1
}
model = lm.LogisticRegression(class_weight=penalty)
predict = sm.cross_val_predict(model,x,y,cv=10)
predict = pd.Series(predict)
fp_ = (predict ==1) & (Loandata['loan_condition']==0)
fp = len(predict[fp_])
print(fp)
tp_ = (predict ==1) & (Loandata['loan_condition']==1)
tp = len(predict[tp_])
print(tp)
tn_ = (predict ==0) & (Loandata['loan_condition']==0)
tn = len(predict[tn_])
print(tn)
fn_ = (predict == 0) & (Loandata["loan_condition"] == 1)
fn = len(predict[fn_])
print(fn)
tpr = tp/float((tp+fn))
fpr = fp/float((fp+tn))
print(tpr)
print(fpr)

1521
26382
4132
7698
0.7741197183098592
0.2690606757473908

Lending Club Loan Data 数据分析