最近在 Kaggle 看了一個比較熱門的人資資料集,他是以數個資料屬性來分析好的員工為什麼會離職,並加以對每個人進行預測其離職率,我對這個分析存疑,認為這是一個資料稀缺,而誤導認為這幾個指標是可行的分析。此文先以正面態度來進行資料探索,再提出我的疑問。
** 資料 **
** 資訊技術 **
- R: ggplot2, corrplot, dplyr, Hmisc, caret, rpart, e1071, caTools
** 資料探索 **
首先,我們仍然來看這資料的內容及完整性:
10 Variables 14999 Observations
----------------------------------------------------------------------------------------------------------
satisfaction_level
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75
14999 0 92 1 0.6128 0.2823 0.11 0.21 0.44 0.64 0.82
.90 .95
0.92 0.96
lowest : 0.09 0.10 0.11 0.12 0.13, highest: 0.96 0.97 0.98 0.99 1.00
----------------------------------------------------------------------------------------------------------
last_evaluation
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75
14999 0 65 1 0.7161 0.1973 0.46 0.49 0.56 0.72 0.87
.90 .95
0.95 0.98
lowest : 0.36 0.37 0.38 0.39 0.40, highest: 0.96 0.97 0.98 0.99 1.00
----------------------------------------------------------------------------------------------------------
number_project
n missing distinct Info Mean Gmd
14999 0 6 0.945 3.803 1.367
lowest : 2 3 4 5 6, highest: 3 4 5 6 7
2 (2388, 0.159), 3 (4055, 0.270), 4 (4365, 0.291), 5 (2761, 0.184), 6 (1174, 0.078), 7 (256, 0.017)
----------------------------------------------------------------------------------------------------------
average_montly_hours
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75
14999 0 215 1 201.1 57.48 130 137 156 200 245
.90 .95
267 275
lowest : 96 97 98 99 100, highest: 306 307 308 309 310
----------------------------------------------------------------------------------------------------------
time_spend_company
n missing distinct Info Mean Gmd
14999 0 8 0.905 3.498 1.43
lowest : 2 3 4 5 6, highest: 5 6 7 8 10
2 (3244, 0.216), 3 (6443, 0.430), 4 (2557, 0.170), 5 (1473, 0.098), 6 (718, 0.048), 7 (188, 0.013), 8
(162, 0.011), 10 (214, 0.014)
----------------------------------------------------------------------------------------------------------
Work_accident
n missing distinct Info Sum Mean Gmd
14999 0 2 0.371 2169 0.1446 0.2474
----------------------------------------------------------------------------------------------------------
left
n missing distinct Info Sum Mean Gmd
14999 0 2 0.544 3571 0.2381 0.3628
----------------------------------------------------------------------------------------------------------
promotion_last_5years
n missing distinct Info Sum Mean Gmd
14999 0 2 0.062 319 0.02127 0.04163
----------------------------------------------------------------------------------------------------------
sales
n missing distinct
14999 0 10
lowest : accounting hr IT management marketing
highest: product_mng RandD sales support technical
Value accounting hr IT management marketing product_mng RandD
Frequency 767 739 1227 630 858 902 787
Proportion 0.051 0.049 0.082 0.042 0.057 0.060 0.052
Value sales support technical
Frequency 4140 2229 2720
Proportion 0.276 0.149 0.181
----------------------------------------------------------------------------------------------------------
salary
n missing distinct
14999 0 3
high (1237, 0.082), low (7316, 0.488), medium (6446, 0.430)
----------------------------------------------------------------------------------------------------------
這10個變數都處理的不錯,均值看起來都相當“平均”
再來看每個資料間的相關程度狀況:
** 初步分析 **
作者(Ludovic Benistant)想知道到底是那些人離職了,作了離職者的長條圖來為他後續的分析作基礎:
** 開始推論 **
作者把目標鎖定在“好員工”,其定義為 Last Evaluation 必須大於等於 0.7 或 Time Spend Company 必須大於等於 4 或 Number Of Project 必須在5個以上:
hr_good_leaving_people <- hr_leaving_people %>%
filter(last_evaluation >= 0.70 | time_spend_company >= 4 | number_project > 5)nrow(hr_good_leaving_people)
然後再對“好員工”作了資料屬性的相關分析:
看起來 left 與 number_project, average_monthly_hours, time_sepnd_company 相關度拉升了不少。
** 模型推論 **
首要條件先選定分析的資料及實驗集,選定“好員工“為主要對象,進行決策樹,貝氏,邏輯迴歸建模,最後得到結論Accuracy高達95%,所以可用邏輯迴歸來預測其離職機率並加以輔導。
分組方法(cross-validation):
train_control<- trainControl(method="cv", number=5, repeats=3)
決策樹預測:
rpartmodel<- train(left~., data=hr_model, trControl=train_control, method="rpart")
# make predictions
predictions<- predict(rpartmodel,hr_model)
貝氏預測:
e1071model2 <- train(left~., data=hr_model, trControl=train_control, method="nb")
# make predictions
predictions<- predict(e1071model2,hr_model)
邏輯迴歸:
gmlmodel <- train(left~., data=hr_model, trControl=train_control, method="LogitBoost")
# make predictions
predictions<- predict(gmlmodel,hr_model)
預測的離職機率及其Performance(last_evaluation)散佈圖:
** 疑問 **
- 先提出結果論的謬論,如果這些好員工的不滿意是來自在公司花的時間太多,專案數太多,那麼降低其工作量,他也會從好員工變成壞員工,這就成了悖論
- 再來看兩個實際的激勵條件:promotion及salary,這是我相當不明瞭為什麼作者不是從這個基本面切入
hr_promo <- data.table(hr%>%filter(promotion_last_5years>0))
hr_high_salary <- data.table(hr%>%filter(salary=="high"))
p_quit = nrow(hr[hr$left==1])/nrow(hr)
p_quit_high_salary = nrow(hr_high_salary[hr_high_salary$left==1])/nrow(hr_high_salary)
p_quit_got_promo = nrow(hr_promo[hr_promo$left==1])/nrow(hr_promo)
p_promo = nrow(hr_promo)/nrow(hr)
p_high_salary = nrow(hr_high_salary)/nrow(hr)
從這裡可以看到總離職率為23.8%,High Salary 族群有 8%, 2.12%的人有得到 Promotion,而這兩類人在其群體內的離職率也相當的低,各為6.62%(High Salary)及5.95%(Got Promotion),那也是說這些人都是好員工了嗎?再來看好員工的要件是否符合這兩類族群:
mean(hr_high_salary$last_evaluation)
mean(hr_high_salary$number_project)
mean(hr_high_salary$time_spend_company)
mean(hr_promo$last_evaluation)
mean(hr_promo$number_project)
mean(hr_promo$time_spend_company)
顯然地,只有 Last Evaluation >= 0.7,以下稱高效群組 皆符合兩類族群,那我就有疑問了:
- 高效群組的評比基礎?高效群組的群體特性?
- 高效群組的實際離職原因?與其他資料屬性相關程度
作者的目的是要保留高效群組,那麼我就先探索高效群組再探討相對的非高效群組。
** 高效與非高效群組探索 **
先把資料都轉化為可分析的數值,並拆分六大群體:
- 高效群組
- 低效群組
- 高效己離職群組
- 高效未離職群組
- 低效己離職群組
- 低效未離職群組
先看看幾個簡單的統計值來看看高效群組的特性:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company
Min. :0.0900 Min. :0.7000 Min. :2.000 Min. : 96.0 Min. : 2.000
1st Qu.:0.5100 1st Qu.:0.7800 1st Qu.:3.000 1st Qu.:175.5 1st Qu.: 3.000
Median :0.7000 Median :0.8600 Median :4.000 Median :223.0 Median : 3.000
Mean :0.6266 Mean :0.8562 Mean :4.172 Mean :215.5 Mean : 3.666
3rd Qu.:0.8300 3rd Qu.:0.9300 3rd Qu.:5.000 3rd Qu.:255.0 3rd Qu.: 4.000
Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0 Max. :10.000
Work_accident left promotion_last_5years salary2
Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :1.000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
Median :0.0000 Median :0.0000 Median :0.00000 Median :2.000
Mean :0.1439 Mean :0.2386 Mean :0.02133 Mean :1.592
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:2.000
Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :3.000
再比對非高效群組的特性:
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company
Min. :0.090 Min. :0.3600 Min. :2.000 Min. : 96.0 Min. : 2.000
1st Qu.:0.420 1st Qu.:0.5000 1st Qu.:2.000 1st Qu.:146.0 1st Qu.: 3.000
Median :0.580 Median :0.5500 Median :3.000 Median :173.0 Median : 3.000
Mean :0.597 Mean :0.5553 Mean :3.379 Mean :184.4 Mean : 3.306
3rd Qu.:0.780 3rd Qu.:0.6200 3rd Qu.:4.000 3rd Qu.:223.0 3rd Qu.: 3.000
Max. :1.000 Max. :0.6900 Max. :7.000 Max. :310.0 Max. :10.000
Work_accident left promotion_last_5years salary2
Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :1.000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
Median :0.0000 Median :0.0000 Median :0.00000 Median :2.000
Mean :0.1455 Mean :0.2375 Mean :0.02119 Mean :1.597
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:2.000
Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :3.000
因為資料屬性有限,我們先撿定可能被拿來評等的資料:number_project,average_montly_hours,time_spend_company, Work_accident ,來作 t 檢定,以得到有可能的差異屬性
data: hr_high_perf2$number_project and hr_low_perf2$number_project
t = 41.529, df = 14740, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.7558328 0.8307159
sample estimates:
mean of x mean of y
4.172427 3.379152
data: hr_high_perf2$average_montly_hours and hr_low_perf2$average_montly_hours
t = 40.08, df = 14781, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
29.58806 32.63087
sample estimates:
mean of x mean of y
215.5359 184.4264
data: hr_high_perf2$time_spend_company and hr_low_perf2$time_spend_company
t = 15.299, df = 14968, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.3143728 0.4067681
sample estimates:
mean of x mean of y
3.666126 3.305556
data: hr_high_perf2$Work_accident and hr_low_perf2$Work_accident
t = -0.2813, df = 14699, p-value = 0.7785
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.012909190 0.009668988
sample estimates:
mean of x mean of y
0.1438553 0.1454754
這裡除了 Work Accident 無明顯差異,故高效與非高效群組的差異來自這3個資料屬性:number_project,average_montly_hours,time_spend_company
** 高效群組離職探索 **
在這將高效己離職群組及高效未離職群組各資料屬性直接作 t 檢定,每一資料屬性都有明顯的差異。那一個差異最大?如果我們採用的策略是讓離職與未離職的差異拉近,採用未離職的平均數與標準差來計算各屬性的 Z Score的差距:
(mean(hr_high_perf2_quit$satisfaction_level)-
mean(hr_high_perf2_nquit$satisfaction_level))/sd(hr_high_perf2_nquit$satisfaction_level)
number_project > average_montly_hours > satisfaction_level > time_spend_company > last_evaluation > salary > Work_accident > promotion_last_5years
這裡讓我非常質疑的是 salary 順位會在那麼後面,因為沒有原始數據只有其分組,不能作實際的數值比較,只能計數比較,可以從圖表上看到高效離職者佔了六成的比例:
再從首要差異來看,可以看到高效未離職群組的專案數在不同薪資水準下都相當“平均”的低於4個專案以下,但離職者皆在平均以上:
把這些專案數很多(>4)人的按薪資群組計算:
> data.table(hr_high_perf2_quit)[number_project>=4, .N, by = list(salary)]
salary N
1: medium 699
2: low 1110
3: high 28
> data.table(hr_high_perf2_nquit)[number_project>=4, .N, by = list(salary)]
salary N
1: low 1683
2: medium 1623
3: high 342
有六成的人(salary=low)離開是不滿同工不同酬,而未離職的人則較為平均,low與medium皆在4成。再檢視另外一個可能原因 - Promotion:
> data.table(hr_high_perf2_quit)[number_project>=4, .N, by = list(promotion_last_5years)]
promotion_last_5years N
1: 0 1833
2: 1 4
> data.table(hr_high_perf2_nquit)[number_project>=4, .N, by = list(promotion_last_5years)]
promotion_last_5years N
1: 0 3537
2: 1 111
離職與未離職者都佔高比例,各占99.7%與96.7%,但如果檢定其平均數,事實上是有差異的,所以Promotion也是對離開的人佔一個相當高的因素。接下來對各順位都可採用如此的分析,如Average Monthly Hours >= 202,就可大概發現原因的脈絡。
** 資料稀缺 **
如果我們想在節制成本狀況下,這家公司的Promotion及High Salary的比例極低,將兩群具體的差異數都拉近,如 number_project, average_montly_hours, time_spend_company 都拉近,那就能有效降低其離職率嗎?而這些因素又會影響其 Performance,那麼會不會一直在 Evaluation 循環中造成更多的離職問題?
分析到此,我一直在想一個資料稀缺問題-我們是不是在有限地資料屬性裡打轉,而忽略了其他應該注意的外在可收集可分析且更重要的資料屬性。