离职人员预测
面对很多IT公司留不住人,或者人员流动大,在很多HR眼里是很困难的问题,所以本文在此对于IBM数据集中的职工数据进行分析,挖掘出哪些数据对于离职率有贡献,并在最后进行建模,预测哪些人最后会流动。
1.导入所需要的库包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)
data_train = pd.read_csv('pfm_train.csv')
#训练集总共1100条数据
data_test = pd.read_csv('pfm_test.csv')
#测试集总共350条数据
data = pd.concat([data_train,data_test],axis = 0)
将测试集和训练集进行合并,便于数据清洗和数据处理
2.查看数据,进行探索性分析
data.info()
Int64Index: 1450 entries, 0 to 349
Data columns (total 31 columns):
Age 1450 non-null int64
Attrition 1100 non-null float64
BusinessTravel 1450 non-null object
Department 1450 non-null object
DistanceFromHome 1450 non-null int64
Education 1450 non-null int64
EducationField 1450 non-null object
EmployeeNumber 1450 non-null int64
EnvironmentSatisfaction 1450 non-null int64
Gender 1450 non-null object
JobInvolvement 1450 non-null int64
JobLevel 1450 non-null int64
JobRole 1450 non-null object
JobSatisfaction 1450 non-null int64
MaritalStatus 1450 non-null object
MonthlyIncome 1450 non-null int64
NumCompaniesWorked 1450 non-null int64
Over18 1450 non-null object
OverTime 1450 non-null object
PercentSalaryHike 1450 non-null int64
PerformanceRating 1450 non-null int64
RelationshipSatisfaction 1450 non-null int64
StandardHours 1450 non-null int64
StockOptionLevel 1450 non-null int64
TotalWorkingYears 1450 non-null int64
TrainingTimesLastYear 1450 non-null int64
WorkLifeBalance 1450 non-null int64
YearsAtCompany 1450 non-null int64
YearsInCurrentRole 1450 non-null int64
YearsSinceLastPromotion 1450 non-null int64
YearsWithCurrManager 1450 non-null int64
dtypes: float64(1), int64(22), object(8)
memory usage: 362.5+ KB
其中训练数据主要包括1450条记录,31个字段,主要字段说明如下:
(1)Age:员工年龄
(2)Attrition:员工是否已经离职,1表示已经离职,2表示未离职,这是目标预测值;
(3)BusinessTravel:商务差旅频率,Non-Travel表示不出差,Travel_Rarely表示不经常出差,Travel_Frequently表示经常出差;
(4)Department:员工所在部门,Sales表示销售部,Research & Development表示研发部,Human Resources表示人力资源部;
(5)DistanceFromHome:公司跟家庭住址的距离,从1到29,1表示最近,29表示最远;
(6)Education:员工的教育程度,从1到5,5表示教育程度最高;
(7)EducationField:员工所学习的专业领域,Life Sciences表示生命科学,Medical表示医疗,Marketing表示市场营销,Technical Degree表示技术学位,Human Resources表示人力资源,Other表示其他;
(8)EmployeeNumber:员工号码;
(9)EnvironmentSatisfaction:员工对于工作环境的满意程度,从1到4,1的满意程度最低,4的满意程度最高;
(10)Gender:员工性别,Male表示男性,Female表示女性;
(11)JobInvolvement:员工工作投入度,从1到4,1为投入度最低,4为投入度最高;
(12)JobLevel:职业级别,从1到5,1为最低级别,5为最高级别;
(13)JobRole:工作角色:Sales Executive是销售主管,Research Scientist是科学研究员,Laboratory Technician实验室技术员,Manufacturing Director是制造总监,Healthcare Representative是医疗代表,Manager是经理,Sales Representative是销售代表,Research Director是研究总监,Human Resources是人力资源;
(14)JobSatisfaction:工作满意度,从1到4,1代表满意程度最低,4代表满意程度最高;
(15)MaritalStatus:员工婚姻状况,Single代表单身,Married代表已婚,Divorced代表离婚;
(16)MonthlyIncome:员工月收入,范围在1009到19999之间;
(17)NumCompaniesWorked:员工曾经工作过的公司数;
(18)Over18:年龄是否超过18岁;
(19)OverTime:是否加班,Yes表示加班,No表示不加班;
(20)PercentSalaryHike:工资提高的百分比;
(21)PerformanceRating:绩效评估;
(22)RelationshipSatisfaction:关系满意度,从1到4,1表示满意度最低,4表示满意度最高;
(23)StandardHours:标准工时;
(24)StockOptionLevel:股票期权水平;
(25)TotalWorkingYears:总工龄;
(26)TrainingTimesLastYear:上一年的培训时长,从0到6,0表示没有培训,6表示培训时间最长;
(27)WorkLifeBalance:工作与生活平衡程度,从1到4,1表示平衡程度最低,4表示平衡程度最高;
(28)YearsAtCompany:在目前公司工作年数;
(29)YearsInCurrentRole:在目前工作职责的工作年数
(30)YearsSinceLastPromotion:距离上次升职时长
(31)YearsWithCurrManager:跟目前的管理者共事年数;
data.describe()
整体数值比较平滑,部分数值收到极值影响,并且基本可以判断出几个字段基本没有意义。(EmployeeNumber,StandardHours,Over18)
for i in data_train.columns:
if data_train[i].dtype == 'int64':
print(i + ':')
print((data_train[data_train['Attrition'] == 1.0][i].value_counts()/data_train[i].value_counts()).sort_values(ascending = False))
print('-----------------------')
Age:
21 0.714286
19 0.625000
20 0.500000
58 0.428571
22 0.416667
23 0.400000
26 0.322581
28 0.285714
29 0.272727
31 0.270833
33 0.255319
25 0.250000
24 0.222222
30 0.187500
44 0.181818
55 0.176471
32 0.170213
39 0.166667
52 0.166667
41 0.161290
56 0.153846
53 0.153846
34 0.132075
47 0.125000
51 0.125000
46 0.120000
35 0.118644
37 0.108108
49 0.090909
36 0.072727
45 0.066667
40 0.063830
42 0.058824
27 0.052632
38 0.051282
50 0.043478
43 0.040000
18 NaN
48 NaN
54 NaN
57 NaN
59 NaN
60 NaN
Name: Age, dtype: float64
-----------------------
Attrition:
1 1.0
0 NaN
Name: Attrition, dtype: float64
-----------------------
DistanceFromHome:
12 0.428571
24 0.400000
22 0.333333
13 0.294118
27 0.272727
25 0.263158
16 0.230769
29 0.217391
20 0.210526
17 0.200000
23 0.200000
9 0.189655
11 0.181818
3 0.174603
19 0.166667
21 0.166667
10 0.147059
2 0.144654
4 0.142857
18 0.142857
6 0.127660
14 0.125000
5 0.122449
1 0.118056
8 0.114754
26 0.105263
15 0.100000
7 0.090909
28 0.066667
Name: DistanceFromHome, dtype: float64
-----------------------
Education:
1 0.214286
3 0.167053
4 0.156146
2 0.145631
5 0.055556
Name: Education, dtype: float64
-----------------------
EnvironmentSatisfaction:
1 0.246512
2 0.157143
4 0.139053
3 0.133531
Name: EnvironmentSatisfaction, dtype: float64
-----------------------
JobInvolvement:
1 0.380952
2 0.168498
3 0.146747
4 0.106796
Name: JobInvolvement, dtype: float64
-----------------------
JobLevel:
1 0.259709
3 0.127389
2 0.107769
5 0.098039
4 0.037037
Name: JobLevel, dtype: float64
-----------------------
JobSatisfaction:
1 0.242009
2 0.174757
3 0.156923
4 0.108571
Name: JobSatisfaction, dtype: float64
-----------------------
NumCompaniesWorked:
5 0.244444
7 0.232143
6 0.211538
1 0.194872
9 0.189189
4 0.148515
8 0.121951
0 0.119205
2 0.115044
3 0.078947
Name: NumCompaniesWorked, dtype: float64
-----------------------
PercentSalaryHike:
24 0.416667
23 0.277778
22 0.244444
15 0.205479
12 0.179856
11 0.179487
17 0.173913
16 0.166667
13 0.160494
20 0.159091
18 0.154930
21 0.142857
19 0.120690
14 0.073333
25 0.071429
Name: PercentSalaryHike, dtype: float64
-----------------------
PerformanceRating:
4 0.202381
3 0.154506
Name: PerformanceRating, dtype: float64
-----------------------
RelationshipSatisfaction:
1 0.204545
3 0.158824
2 0.152074
4 0.142415
Name: RelationshipSatisfaction, dtype: float64
-----------------------
StandardHours:
80 0.161818
Name: StandardHours, dtype: float64
-----------------------
StockOptionLevel:
0 0.251586
3 0.220339
1 0.085202
2 0.065574
Name: StockOptionLevel, dtype: float64
-----------------------
TotalWorkingYears:
40 1.000000
1 0.522388
2 0.285714
11 0.280000
7 0.261538
34 0.250000
0 0.200000
31 0.200000
3 0.187500
4 0.183673
8 0.175000
5 0.161765
6 0.160920
33 0.142857
12 0.142857
24 0.142857
9 0.130435
13 0.125000
15 0.121212
19 0.117647
10 0.112500
18 0.111111
26 0.111111
22 0.100000
14 0.095238
23 0.058824
17 0.047619
20 0.045455
16 0.037037
21 NaN
25 NaN
27 NaN
28 NaN
29 NaN
30 NaN
32 NaN
35 NaN
36 NaN
37 NaN
38 NaN
Name: TotalWorkingYears, dtype: float64
-----------------------
TrainingTimesLastYear:
0 0.250000
4 0.191489
2 0.181818
3 0.145119
6 0.125000
5 0.123596
1 0.100000
Name: TrainingTimesLastYear, dtype: float64
-----------------------
WorkLifeBalance:
1 0.269841
4 0.194175
2 0.175781
3 0.141593
Name: WorkLifeBalance, dtype: float64
-----------------------
YearsAtCompany:
1 0.358209
32 0.333333
31 0.333333
0 0.266667
24 0.250000
2 0.212766
33 0.200000
6 0.153846
10 0.152174
4 0.150000
3 0.144330
13 0.133333
5 0.132450
7 0.128571
16 0.125000
19 0.125000
9 0.109375
11 0.105263
8 0.101695
21 0.100000
22 0.076923
12 NaN
14 NaN
15 NaN
17 NaN
18 NaN
20 NaN
25 NaN
26 NaN
27 NaN
29 NaN
30 NaN
34 NaN
36 NaN
37 NaN
Name: YearsAtCompany, dtype: float64
-----------------------
YearsInCurrentRole:
15 0.333333
0 0.288043
1 0.219512
2 0.189091
4 0.168675
7 0.133333
14 0.125000
12 0.125000
3 0.121495
9 0.080000
8 0.072464
10 0.066667
6 0.038462
5 NaN
11 NaN
13 NaN
16 NaN
17 NaN
18 NaN
Name: YearsInCurrentRole, dtype: float64
-----------------------
YearsSinceLastPromotion:
6 0.250000
7 0.206897
13 0.200000
0 0.191011
3 0.175000
2 0.162602
1 0.149606
9 0.142857
15 0.090909
5 0.083333
4 0.069767
8 NaN
10 NaN
11 NaN
12 NaN
14 NaN
Name: YearsSinceLastPromotion, dtype: float64
-----------------------
YearsWithCurrManager:
0 0.319797
14 0.250000
1 0.169492
4 0.150685
7 0.147239
2 0.138462
3 0.134615
9 0.130435
6 0.100000
5 0.100000
8 0.083333
10 0.058824
11 0.055556
12 NaN
13 NaN
15 NaN
16 NaN
17 NaN
Name: YearsWithCurrManager, dtype: float64
-----------------------
计算数值型各个类别的离职概率,大概了解一部分对于离职率的影响
for i in data_train.columns:
if data_train[i].dtype == 'O':
print(i + ':')
print((data_train[data_train['Attrition'] == 1.0][i].value_counts()/data_train[i].value_counts()).sort_values(ascending=False))
print('-----------------------')
BusinessTravel:
Travel_Frequently 0.224390
Travel_Rarely 0.156290
Non-Travel 0.083333
Name: BusinessTravel, dtype: float64
-----------------------
Department:
Human Resources 0.214286
Sales 0.202417
Research & Development 0.140303
Name: Department, dtype: float64
-----------------------
EducationField:
Human Resources 0.315789
Technical Degree 0.239130
Marketing 0.212598
Life Sciences 0.151515
Medical 0.136499
Other 0.111111
Name: EducationField, dtype: float64
-----------------------
Gender:
Male 0.166922
Female 0.154362
Name: Gender, dtype: float64
-----------------------
JobRole:
Sales Representative 0.403509
Human Resources 0.272727
Laboratory Technician 0.209756
Research Scientist 0.185520
Sales Executive 0.170040
Manufacturing Director 0.079208
Manager 0.062500
Healthcare Representative 0.050000
Research Director 0.035714
Name: JobRole, dtype: float64
-----------------------
MaritalStatus:
Single 0.259669
Married 0.124000
Divorced 0.092437
Name: MaritalStatus, dtype: float64
-----------------------
Over18:
Y 0.161818
Name: Over18, dtype: float64
-----------------------
OverTime:
Yes 0.320261
No 0.100756
Name: OverTime, dtype: float64
-----------------------
可以看出,大学专业和出差频率两项有明显的影响,单身职员的离职的概率比较大,职业角色里,代理销售的人员流动大。
3.数据处理
plt.figure(figsize=(14,5))
sns.barplot(x='Age', y='Attrition', data = data , palette = 'Set2')
根据常识,年龄是一个员工离职的重要因素,往往随着年纪的增大,员工的稳定性更好,离职的倾向更小。所以进行作图,可以看出24之前和58岁的人员流动很大,而在年龄分段之内离职倾向不大,趋于缓和,部分的特数值没有离职人员。
def resetAge(name):
if (name < 24) & (name > 18) & (name == 58):
return 1
elif (name == 18) & (name == 48) & (name == 54) & (name == 57) & (name > 58) :
return 0
else:
return 2
定义函数,对于不同的分段年龄给予不同的标记,把多样性数值化为不同标签的离散类别数值。
facet = sns.FacetGrid(data,hue = 'Attrition' ,aspect=3)
facet.map(sns.kdeplot,'MonthlyIncome',shade = True)
facet.set(xlim=(0,data['MonthlyIncome'].max()))
facet.add_legend()
画出kde函数,看出离职人员在月薪分段中的概率,可以看出,在0-7000之间离职概率最大。
def resetSalary(s):
if s>0 & s<3725:
return 0
elif s>=3725 & s<11250:
return 1
else:
return 2
定义函数,将薪水化为几个档次和阶段,这样更好的便于分类。
plt.figure(figsize=(14,5))
sns.barplot(x='PercentSalaryHike', y='Attrition', data = data , palette = 'Set2')
def resetPerHike(s):
if s >= 22 & s < 25:
return 0
elif (s >= 11 & s < 14) | (s > 14 & s < 22):
return 1
else:
return 2
将函数方法应用到各个分列数据
data['PercentSalaryHike'] = data['PercentSalaryHike'].apply(resetPerHike)
data['MonthlyIncome'] = data['MonthlyIncome'].apply(resetSalary)
data['Age'] = data['Age'].apply(resetAge)
cata_result = pd.DataFrame()
for i in data.columns:
if data[i].dtype == 'O':
cata = pd.DataFrame()
cata = pd.get_dummies(data[i],prefix=i)
cata_result = pd.concat([cata_result,cata],axis=1)
将数据类型为Object的字段,全部转换为one-hot编码,便于建模
for i in data.columns:
if data[i].dtype == 'O':
data = data.drop(i,axis=1)
data = pd.concat([data,cata_result],axis=1)
data = data.drop(['StandardHours','Over18_Y','EmployeeNumber'],axis =1)
丢弃掉没有作用的数据
4.建模预测数据
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.linear_model import LogisticRegression
sep = 1100
X = data.iloc[0:sep,:].drop('Attrition',axis = 1)
y = data.iloc[0:sep,:]['Attrition']
data_test_use = data.iloc[sep:,:]
data_test_use1 = data_test_use.drop('Attrition',axis=1)
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=2)
model = {}
model['LR'] = LogisticRegression()
model['svm'] = svm.SVC()
#设置随机森林系数
model['RMF'] = RandomForestClassifier(random_state = 10, warm_start = True,
n_estimators = 26,
max_depth = 6,
max_features = 'sqrt')
model['CART'] = DecisionTreeClassifier()
model['KNN'] = KNeighborsClassifier()
for i in model:
model[i].fit(X,y)
score = cross_val_score(model[i],X,y,cv=5,scoring='accuracy')
print("%s:%.3f(%.3f)"%(i,score.mean(),score.std()))
LR:0.875(0.014)
svm:0.843(0.005)
RMF:0.852(0.006)
CART:0.781(0.019)
KNN:0.829(0.010)
导入建模包,包括随机森林,决策树,KNN算法,svm分类,逻辑回归。
将数据重新划分为训练集和测试集,通过交叉检验,测试模型效果
result = (model['LR'].predict(data_test_use1)).astype('int')
data_predict = pd.DataFrame()
data_predict['result'] = result
data_predict.to_csv('sample.csv',index=None)
输出结果文件,提交至dc数据竞赛并获得评分0.899,前1%的排名。
ps:尝试了一次xgboost,但其实效果并不好,应该是自己的参数没有调整完成。
import xgboost as xgb
xgb_train = xgb.DMatrix(data=X,label=y)
Trate = 0.25
#参数调整
params = {'booster':'gbtree',
'eta':0.1,
'max_delta_step':0,
'subsample':0.9,
'colsample_bytree':0.9,
'base_score':Trate,
'objective':'binary:logistic',
'lambda':5,
'alpha':8,
'random_seed':100
}
#评分方式为auc
params['eval_metric'] = 'auc'
xgb_model = xgb.train(params,xgb_train,num_boost_round=200,maximize = True,verbose_eval=True)
#用xgboost训练好的模型进行预测
res = xgb_model.predict(xgb.DMatrix(data_test_use1))
#因为最终的结果是为0和1的概率,所以进行转换
for i in range(len(res)):
if res[i] < 0.5:
res[i] = 0
else:
res[i] = 1