电信客户流失数据分析

分析背景

某电信公司市场部为了预防用户流失，收集了已经打好流失标签的用户数据。现在要对流失用户情况进行分析，找出哪些用户可能会流失？

理解数据

采集数据

本数据集描述了电信用户是否流失以及其相关信息，共包含7043条数据，共21个字段，分别介绍如下：

customerID ：用户ID。
gender：性别。（Female & Male）
SeniorCitizen ：老年用户（1表示是，0表示不是）
Partner ：伴侣用户（Yes or No）
Dependents ：亲属用户（Yes or No）
tenure ：在网时长（0-72月）
PhoneService ：是否开通电话服务业务（Yes or No）
MultipleLines：是否开通了多线业务（Yes 、No or No phoneservice 三种）
InternetService：是否开通互联网服务（No, DSL数字网络，fiber optic光纤网络三种）
OnlineSecurity：是否开通网络安全服务（Yes，No，No internetserive 三种）
OnlineBackup：是否开通在线备份业务（Yes，No，No internetserive 三种）
DeviceProtection：是否开通了设备保护业务（Yes，No，No internetserive 三种）
TechSupport：是否开通了技术支持服务（Yes，No，No internetserive 三种）
StreamingTV：是否开通网络电视（Yes，No，No internetserive 三种）
StreamingMovies：是否开通网络电影（Yes，No，No internetserive 三种）
Contract：签订合同方式（按月，一年，两年）
PaperlessBilling：是否开通电子账单（Yes or No）
PaymentMethod：付款方式（bank transfer，credit card，electronic check，mailed check）
MonthlyCharges：月费用
TotalCharges：总费用
Churn：该用户是否流失（Yes or No）

导入数据

import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import matplotlib as mpl
mpl.rcParams['font.sans-serif'] = [u'SimHei']
mpl.rcParams['axes.unicode_minus'] = False
import warnings
warnings.filterwarnings("ignore")
df=pd.read_csv(r"D:\PycharmProjects\WA_Fn-UseC_-Telco-Customer-Churn.csv")

查看数据

df.head()

	customerID	gender	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	...	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	7590-VHVEG	Female	Yes	No	1	No	No phone service	DSL	No	...	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No
1	5575-GNVDE	Male	No	No	34	Yes	No	DSL	Yes	...	Yes	No	No	No	One year	No	Mailed check	56.95	1889.5	No
2	3668-QPYBK	Male	No	No	2	Yes	No	DSL	Yes	...	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	Yes
3	7795-CFOCW	Male	No	No	45	No	No phone service	DSL	Yes	...	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	No
4	9237-HQITU	Female	No	No	2	Yes	No	Fiber optic	No	...	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	Yes

5 rows × 21 columns

print ("行: " ,df.shape[0])
print ("列: " ,df.shape[1])
print ("特征: \n" ,df.columns.tolist())
print ("缺失值：", df.isnull().sum().values.sum())
print ("重复值：", df.customerID.duplicated().sum())
print ("唯一值:\n",df.nunique())

行:  7043
列:  21
特征: 
 ['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']
缺失值： 0
重复值： 0
唯一值:
 customerID          7043
gender                 2
SeniorCitizen          2
Partner                2
Dependents             2
tenure                73
PhoneService           2
MultipleLines          3
InternetService        3
OnlineSecurity         3
OnlineBackup           3
DeviceProtection       3
TechSupport            3
StreamingTV            3
StreamingMovies        3
Contract               3
PaperlessBilling       2
PaymentMethod          4
MonthlyCharges      1585
TotalCharges        6531
Churn                  2
dtype: int64

df.describe()

	SeniorCitizen	tenure	MonthlyCharges
count	7043.000000	7043.000000	7043.000000
mean	0.162147	32.371149	64.761692
std	0.368612	24.559481	30.090047
min	0.000000	0.000000	18.250000
25%	0.000000	9.000000	35.500000
50%	0.000000	29.000000	70.350000
75%	0.000000	55.000000	89.850000
max	1.000000	72.000000	118.750000

数据清洗、数据一致化

df.dtypes#查看数据类型

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
customerID          7043 non-null object
gender              7043 non-null object
SeniorCitizen       7043 non-null int64
Partner             7043 non-null object
Dependents          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB

replace_list=['OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']
for i in replace_list:
    df[i]=df[i].str.replace('No internet service','No')
df['InternetService']=df['InternetService'].str.replace('Fiber optic','Yes')
df['InternetService']=df['InternetService'].str.replace('DSL','Yes')
df['MultipleLines']=df['MultipleLines'].str.replace('No phone service','No')

df.SeniorCitizen=df.SeniorCitizen.replace({0:'No',1:'Yes'})
df.Churn=df.Churn.replace({'No':'非流失客户','Yes':'流失客户'})

df.TotalCharges=pd.to_numeric(df.TotalCharges,errors="coerce")#转换类型
df.TotalCharges.dtypes#查看数据类型

dtype('float64')

df.tenure.describe()

count    7043.000000
mean       32.371149
std        24.559481
min         0.000000
25%         9.000000
50%        29.000000
75%        55.000000
max        72.000000
Name: tenure, dtype: float64

#在网时长分组
bins_t = [0,6,12,18,24,30,36,42,48,54,60,66,72]
level_t = ['0.5年','1年', '1.5年', '2年', '2.5年', '3年', '3.5年', '4年', '4.5年','5年','5.5年','6年']
df['tenure_group']=pd.cut(df.tenure,bins=bins_t,labels=level_t,right=True)

df.MonthlyCharges.describe()

count    7043.000000
mean       64.761692
std        30.090047
min        18.250000
25%        35.500000
50%        70.350000
75%        89.850000
max       118.750000
Name: MonthlyCharges, dtype: float64

#月租费分组
bins_M = [0,20,40,60,80,100,120]
level_M = ['20','40', '60', '80', '100', '120']
df['MonthlyCharges_group']=pd.cut(df.MonthlyCharges,bins=bins_M,labels=level_M,right=True)

df.dropna(inplace=True)#缺失值数量不多，删除

df.isnull().sum()

customerID              0
gender                  0
SeniorCitizen           0
Partner                 0
Dependents              0
tenure                  0
PhoneService            0
MultipleLines           0
InternetService         0
OnlineSecurity          0
OnlineBackup            0
DeviceProtection        0
TechSupport             0
StreamingTV             0
StreamingMovies         0
Contract                0
PaperlessBilling        0
PaymentMethod           0
MonthlyCharges          0
TotalCharges            0
Churn                   0
tenure_group            0
MonthlyCharges_group    0
dtype: int64

df.Churn.value_counts()

非流失客户    5163
流失客户     1869
Name: Churn, dtype: int64

数据可视化呈现

计算整体流失率

df_Churn=df[df['Churn']=='流失客户']#流失客户
Rate_Churn=df[df['Churn']=='流失客户'].shape[0]/df['Churn'].shape[0]
print('经计算，整体流失率={:.2%}'.format(Rate_Churn))

经计算，整体流失率=26.58%

fig=plt.figure(num=1,figsize=(5,5))
plt.pie(df['Churn'].value_counts(),autopct="%.2f%%",colors=['grey','lightcoral'])
plt.title('Proportion of Customer Churn')
plt.legend(labels=['非流失客户','流失客户'],loc='best')

1.png

问题1：流失用户的特征是什么？

对指标进行归纳梳理，分用户画像指标，消费产品指标，消费信息指标。

用户画像指标
- 人口统计指标：'gender','SeniorCitizen','Partner','Dependents'
- 用户活跃度: 'tenure'
消费产品指标
- 手机服务： 'PhoneService', 'MultipleLines'
- 网络服务： 'InternetService','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies'
消费信息指标
- 收入指标:'MonthlyCharges','TotalCharges'
- 收入相关指标:'Contract', 'PaperlessBilling', 'PaymentMethod'

采用整体流失率作为标准，用于后面分析各维度的流失率做对比。

人口统计指标：'gender','SeniorCitizen','Partner','Dependents'

fig,axes=plt.subplots(2,2,figsize=(15,15))
for i ,j in enumerate(['gender','SeniorCitizen','Partner','Dependents']):
    plt.subplot(2,2,i+1)
    ax=sns.countplot(x=j,hue='Churn',data=df,palette="Set2",order=df_Churn.groupby(j)['Churn'].value_counts().index.levels[0])
    plt.title(str(j), fontsize=20)
    plt.xlabel('Churn',fontsize=15)
    plt.xticks(fontsize=15)
    plt.legend(fontsize=15)
    lent=df_Churn.groupby(j)['Churn'].value_counts().shape[0]
    for p in range(lent):
        Rate=df_Churn.groupby(j)['Churn'].value_counts()[p]/df.groupby(j,as_index=False)['Churn'].size()[p]    
        plt.text(p,100,'流失率{:.2%}'.format(Rate),fontsize=12)
    i+=1

2.png

结论1

性别对客户流失并无显著影响,与整体流失率相当
老年用户、未婚用户、非亲属用户流失占比相对较高

用户活跃度: 'tenure'

fig=plt.figure(num=3,figsize=(15,15))

ax1=fig.add_subplot(2,2,1)
ax1=sns.boxplot(df_Churn["tenure"],palette="Set2",orient = "v")
plt.title('流失用户在网时长')

ax2=fig.add_subplot(2,2,2)
ax2=sns.boxplot(df["tenure"],palette="Set2",orient = "v")
plt.title('整体用户在网时长')

ax3=fig.add_subplot(2,2,3)
ax3=sns.countplot(x='tenure_group',hue='Churn',data=df,palette="Set2")
plt.title('流失用户在网时长分布')

grouped_values=df_Churn.groupby('tenure_group').sum().reset_index()
tenure_churn_per=df_Churn.groupby('tenure_group')['Churn'].count()/df_Churn.groupby('tenure_group')['Churn'].count().sum()

ax4=fig.add_subplot(2,2,4)
ax4=tenure_churn_per.plot(linestyle='dashed', marker='o')
ax4.axhline(y=Rate_Churn,ls=":",c="red")
ax4.annotate(s='整体流失率均值{:.2%}'.format(Rate_Churn),xy=(0.28,0.25),fontsize=15)
plt.title('流失用户各在网时段流失率均值分布')
for index,row in grouped_values.iterrows():
    plt.text(index+0.025,tenure_churn_per.values[index],'{:.2%}'.format(tenure_churn_per.values[index],2),fontsize=12)

3.png

结论2：

流失用户的在网时长较短，多为10个月，且呈左偏分布。
新用户在1年内的流失率高于整体流失率。

消费产品指标

fig,axes=plt.subplots(3,3,figsize=(18,18))

for i ,j in enumerate(['PhoneService', 'MultipleLines', 'InternetService','OnlineSecurity', 
                       'OnlineBackup', 'DeviceProtection', 'TechSupport','StreamingTV', 'StreamingMovies']):

    plt.subplot(3,3,i+1)
    ax=sns.countplot(x=j,hue='Churn',data=df,palette="Set2",order=df_Churn.groupby(j)['Churn'].value_counts().index.levels[0])
    plt.title(str(j), fontsize=15)
    plt.xlabel('Churn',fontsize=10)
    plt.xticks(fontsize=10)
    plt.legend(fontsize=10)
    lent=df_Churn.groupby(j)['Churn'].value_counts().shape[0]
    for p in range(lent):
        Rate=df_Churn.groupby(j)['Churn'].value_counts()[p]/df.groupby(j,as_index=False)['Churn'].size()[p]    
        plt.text(p,100,'流失率{:.2%}'.format(Rate),fontsize=12)
    i+=1

4.png

结论

手机服务中，订购PhoneService的用户群流失率基本上与整体流失率持平，订购MultipleLines的流失率比整体流失率高；
网络服务中
1.其中订购InternetService的用户流失率较整体流失率偏高。
2.其中技术性服务（OnlineSecurity、OnlineBackup、DeviceProtection、TechSupport）中订购的用户流失率均比整体流失率低，而未订购的则高出整体流失率不少。
3.其中在娱乐性服务（StreamingTV、StreamingMovies）上，订购的流失率都比整体流失率高。

消费信息指标

收入指标

fig = plt.figure(num=5,figsize=(15,15))
ax1 = fig.add_subplot(3,2,1)
ax1=sns.distplot(df.MonthlyCharges, hist=True,color='green')
ax1.set_ylim(0,0.032)
ax1.set_title("整体客户月消费分布")

ax2 = fig.add_subplot(3,2,2)
ax2=sns.distplot(df.TotalCharges,hist=True,color='green')
ax2.set_title("整体客户整体消费分布")

ax3 = fig.add_subplot(3,2,3)
ax3=sns.distplot(df_Churn['MonthlyCharges'],hist=True)
ax3.set_ylim(0,0.032)
ax3.set_title("流失客户月消费分布")

ax4=fig.add_subplot(3,2,4)
ax4=sns.distplot(df_Churn['TotalCharges'],hist=True)
plt.title('流失用户整体消费分布')

grouped_values=df_Churn.groupby('MonthlyCharges_group').sum().reset_index()
MonthlyCharges_churn_per=df_Churn.groupby('MonthlyCharges_group')['Churn'].count()/df_Churn.groupby('MonthlyCharges_group')['Churn'].count().sum()

ax5=fig.add_subplot(3,2,5)
ax5=MonthlyCharges_churn_per.plot(linestyle='dashed', marker='o')
ax5.axhline(y=Rate_Churn,ls=":",c="red")
ax5.annotate(s='整体流失率均值{:.2%}'.format(Rate_Churn),xy=(0.28,0.25),fontsize=15)
plt.title('流失用户月消费流失率均值分布')
for index,row in grouped_values.iterrows():
    plt.text(index+0.025,MonthlyCharges_churn_per.values[index],'{:.2%}'.format(MonthlyCharges_churn_per.values[index],2),fontsize=12)

5.png

收入相关指标

fig=plt.figure(num=6,figsize=(15,15))

for i ,j in enumerate(["Contract","PaperlessBilling",'PaymentMethod']):
    plt.subplot(3,1,i+1)
    ax=sns.countplot(x=j,hue='Churn',data=df,palette="Set2",order=df_Churn.groupby(j)['Churn'].value_counts().index.levels[0])
    plt.title(str(j), fontsize=15)
    plt.xlabel('Churn',fontsize=15)
    plt.xticks(fontsize=10)
    plt.legend(fontsize=10)
    lent=df_Churn.groupby(j)['Churn'].value_counts().shape[0]
    for p in range(lent):
        Rate=df_Churn.groupby(j)['Churn'].value_counts()[p]/df.groupby(j,as_index=False)['Churn'].size()[p]    
        plt.text(p+0.1,100,'流失率{:.2%}'.format(Rate),fontsize=10)
    i+=1

6.png

结论

流失用户的主要来源是月消费80-100元的用户。
合同期限上，选择月签的用户流失率最高，达到42.7%,时长越长，流失越低。
选择无纸账单的用户流失率高于选择纸账单的用户。
支付上，选择Electronic check支付方式的用户流失最高，达到45%，其他三种流失率相差无几。

总结

以下用户群的流失率是高于整体流失率的:

用户维度：老人，单身，无亲属，在网时长小于10个月。
产品维度：开通多线服务，开通光纤网络，不开通技术性增值服务,开通娱乐性服务。
消费特征：月消费80-100元，选择月签，Electronic check支付。

最后编辑于：2019.10.07 14:32:28

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 216,142评论 6赞 498
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 92,298评论 3赞 392
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 162,068评论 0赞 351
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,081评论 1赞 291
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,099评论 6赞 388
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,071评论 1赞 295
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 39,990评论 3赞 417
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 38,832评论 0赞 273
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,274评论 1赞 310
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,488评论 2赞 331
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,649评论 1赞 347
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,378评论 5赞 343
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 40,979评论 3赞 325
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,625评论 0赞 21
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,796评论 1赞 268
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 47,643评论 2赞 368
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,545评论 2赞 352