分析背景
某电信公司市场部为了预防用户流失,收集了已经打好流失标签的用户数据。现在要对流失用户情况进行分析,找出哪些用户可能会流失?
理解数据
采集数据
本数据集描述了电信用户是否流失以及其相关信息,共包含7043条数据,共21个字段,分别介绍如下:
- customerID : 用户ID。
- gender:性别。(Female & Male)
- SeniorCitizen :老年用户 (1表示是,0表示不是)
- Partner :伴侣用户 (Yes or No)
- Dependents :亲属用户 (Yes or No)
- tenure : 在网时长(0-72月)
- PhoneService : 是否开通电话服务业务 (Yes or No)
- MultipleLines: 是否开通了多线业务(Yes 、No or No phoneservice 三种)
- InternetService:是否开通互联网服务 (No, DSL数字网络,fiber optic光纤网络 三种)
- OnlineSecurity:是否开通网络安全服务(Yes,No,No internetserive 三种)
- OnlineBackup:是否开通在线备份业务(Yes,No,No internetserive 三种)
- DeviceProtection:是否开通了设备保护业务(Yes,No,No internetserive 三种)
- TechSupport:是否开通了技术支持服务(Yes,No,No internetserive 三种)
- StreamingTV:是否开通网络电视(Yes,No,No internetserive 三种)
- StreamingMovies:是否开通网络电影(Yes,No,No internetserive 三种)
- Contract:签订合同方式 (按月,一年,两年)
- PaperlessBilling:是否开通电子账单(Yes or No)
- PaymentMethod:付款方式(bank transfer,credit card,electronic check,mailed check)
- MonthlyCharges:月费用
- TotalCharges:总费用
- Churn:该用户是否流失(Yes or No)
导入数据
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import matplotlib as mpl
mpl.rcParams['font.sans-serif'] = [u'SimHei']
mpl.rcParams['axes.unicode_minus'] = False
import warnings
warnings.filterwarnings("ignore")
df=pd.read_csv(r"D:\PycharmProjects\WA_Fn-UseC_-Telco-Customer-Churn.csv")
查看数据
df.head()
customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | ... | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | ... | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
5 rows × 21 columns
print ("行: " ,df.shape[0])
print ("列: " ,df.shape[1])
print ("特征: \n" ,df.columns.tolist())
print ("缺失值:", df.isnull().sum().values.sum())
print ("重复值:", df.customerID.duplicated().sum())
print ("唯一值:\n",df.nunique())
行: 7043
列: 21
特征:
['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']
缺失值: 0
重复值: 0
唯一值:
customerID 7043
gender 2
SeniorCitizen 2
Partner 2
Dependents 2
tenure 73
PhoneService 2
MultipleLines 3
InternetService 3
OnlineSecurity 3
OnlineBackup 3
DeviceProtection 3
TechSupport 3
StreamingTV 3
StreamingMovies 3
Contract 3
PaperlessBilling 2
PaymentMethod 4
MonthlyCharges 1585
TotalCharges 6531
Churn 2
dtype: int64
df.describe()
SeniorCitizen | tenure | MonthlyCharges | |
---|---|---|---|
count | 7043.000000 | 7043.000000 | 7043.000000 |
mean | 0.162147 | 32.371149 | 64.761692 |
std | 0.368612 | 24.559481 | 30.090047 |
min | 0.000000 | 0.000000 | 18.250000 |
25% | 0.000000 | 9.000000 | 35.500000 |
50% | 0.000000 | 29.000000 | 70.350000 |
75% | 0.000000 | 55.000000 | 89.850000 |
max | 1.000000 | 72.000000 | 118.750000 |
数据清洗、数据一致化
df.dtypes#查看数据类型
customerID object
gender object
SeniorCitizen int64
Partner object
Dependents object
tenure int64
PhoneService object
MultipleLines object
InternetService object
OnlineSecurity object
OnlineBackup object
DeviceProtection object
TechSupport object
StreamingTV object
StreamingMovies object
Contract object
PaperlessBilling object
PaymentMethod object
MonthlyCharges float64
TotalCharges object
Churn object
dtype: object
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
customerID 7043 non-null object
gender 7043 non-null object
SeniorCitizen 7043 non-null int64
Partner 7043 non-null object
Dependents 7043 non-null object
tenure 7043 non-null int64
PhoneService 7043 non-null object
MultipleLines 7043 non-null object
InternetService 7043 non-null object
OnlineSecurity 7043 non-null object
OnlineBackup 7043 non-null object
DeviceProtection 7043 non-null object
TechSupport 7043 non-null object
StreamingTV 7043 non-null object
StreamingMovies 7043 non-null object
Contract 7043 non-null object
PaperlessBilling 7043 non-null object
PaymentMethod 7043 non-null object
MonthlyCharges 7043 non-null float64
TotalCharges 7043 non-null object
Churn 7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB
replace_list=['OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies']
for i in replace_list:
df[i]=df[i].str.replace('No internet service','No')
df['InternetService']=df['InternetService'].str.replace('Fiber optic','Yes')
df['InternetService']=df['InternetService'].str.replace('DSL','Yes')
df['MultipleLines']=df['MultipleLines'].str.replace('No phone service','No')
df.SeniorCitizen=df.SeniorCitizen.replace({0:'No',1:'Yes'})
df.Churn=df.Churn.replace({'No':'非流失客户','Yes':'流失客户'})
df.TotalCharges=pd.to_numeric(df.TotalCharges,errors="coerce")#转换类型
df.TotalCharges.dtypes#查看数据类型
dtype('float64')
df.tenure.describe()
count 7043.000000
mean 32.371149
std 24.559481
min 0.000000
25% 9.000000
50% 29.000000
75% 55.000000
max 72.000000
Name: tenure, dtype: float64
#在网时长分组
bins_t = [0,6,12,18,24,30,36,42,48,54,60,66,72]
level_t = ['0.5年','1年', '1.5年', '2年', '2.5年', '3年', '3.5年', '4年', '4.5年','5年','5.5年','6年']
df['tenure_group']=pd.cut(df.tenure,bins=bins_t,labels=level_t,right=True)
df.MonthlyCharges.describe()
count 7043.000000
mean 64.761692
std 30.090047
min 18.250000
25% 35.500000
50% 70.350000
75% 89.850000
max 118.750000
Name: MonthlyCharges, dtype: float64
#月租费分组
bins_M = [0,20,40,60,80,100,120]
level_M = ['20','40', '60', '80', '100', '120']
df['MonthlyCharges_group']=pd.cut(df.MonthlyCharges,bins=bins_M,labels=level_M,right=True)
df.dropna(inplace=True)#缺失值数量不多,删除
df.isnull().sum()
customerID 0
gender 0
SeniorCitizen 0
Partner 0
Dependents 0
tenure 0
PhoneService 0
MultipleLines 0
InternetService 0
OnlineSecurity 0
OnlineBackup 0
DeviceProtection 0
TechSupport 0
StreamingTV 0
StreamingMovies 0
Contract 0
PaperlessBilling 0
PaymentMethod 0
MonthlyCharges 0
TotalCharges 0
Churn 0
tenure_group 0
MonthlyCharges_group 0
dtype: int64
df.Churn.value_counts()
非流失客户 5163
流失客户 1869
Name: Churn, dtype: int64
数据可视化呈现
计算整体流失率
df_Churn=df[df['Churn']=='流失客户']#流失客户
Rate_Churn=df[df['Churn']=='流失客户'].shape[0]/df['Churn'].shape[0]
print('经计算,整体流失率={:.2%}'.format(Rate_Churn))
经计算,整体流失率=26.58%
fig=plt.figure(num=1,figsize=(5,5))
plt.pie(df['Churn'].value_counts(),autopct="%.2f%%",colors=['grey','lightcoral'])
plt.title('Proportion of Customer Churn')
plt.legend(labels=['非流失客户','流失客户'],loc='best')
问题1:流失用户的特征是什么?
对指标进行归纳梳理,分用户画像指标,消费产品指标,消费信息指标。
- 用户画像指标
- 人口统计指标:'gender','SeniorCitizen','Partner','Dependents'
- 用户活跃度: 'tenure'
- 消费产品指标
- 手机服务: 'PhoneService', 'MultipleLines'
- 网络服务: 'InternetService','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies'
- 消费信息指标
- 收入指标:'MonthlyCharges','TotalCharges'
- 收入相关指标:'Contract', 'PaperlessBilling', 'PaymentMethod'
采用整体流失率作为标准,用于后面分析各维度的流失率做对比。
人口统计指标:'gender','SeniorCitizen','Partner','Dependents'
fig,axes=plt.subplots(2,2,figsize=(15,15))
for i ,j in enumerate(['gender','SeniorCitizen','Partner','Dependents']):
plt.subplot(2,2,i+1)
ax=sns.countplot(x=j,hue='Churn',data=df,palette="Set2",order=df_Churn.groupby(j)['Churn'].value_counts().index.levels[0])
plt.title(str(j), fontsize=20)
plt.xlabel('Churn',fontsize=15)
plt.xticks(fontsize=15)
plt.legend(fontsize=15)
lent=df_Churn.groupby(j)['Churn'].value_counts().shape[0]
for p in range(lent):
Rate=df_Churn.groupby(j)['Churn'].value_counts()[p]/df.groupby(j,as_index=False)['Churn'].size()[p]
plt.text(p,100,'流失率{:.2%}'.format(Rate),fontsize=12)
i+=1
结论1
- 性别对客户流失并无显著影响,与整体流失率相当
- 老年用户、未婚用户、非亲属用户流失占比相对较高
用户活跃度: 'tenure'
fig=plt.figure(num=3,figsize=(15,15))
ax1=fig.add_subplot(2,2,1)
ax1=sns.boxplot(df_Churn["tenure"],palette="Set2",orient = "v")
plt.title('流失用户在网时长')
ax2=fig.add_subplot(2,2,2)
ax2=sns.boxplot(df["tenure"],palette="Set2",orient = "v")
plt.title('整体用户在网时长')
ax3=fig.add_subplot(2,2,3)
ax3=sns.countplot(x='tenure_group',hue='Churn',data=df,palette="Set2")
plt.title('流失用户在网时长分布')
grouped_values=df_Churn.groupby('tenure_group').sum().reset_index()
tenure_churn_per=df_Churn.groupby('tenure_group')['Churn'].count()/df_Churn.groupby('tenure_group')['Churn'].count().sum()
ax4=fig.add_subplot(2,2,4)
ax4=tenure_churn_per.plot(linestyle='dashed', marker='o')
ax4.axhline(y=Rate_Churn,ls=":",c="red")
ax4.annotate(s='整体流失率均值{:.2%}'.format(Rate_Churn),xy=(0.28,0.25),fontsize=15)
plt.title('流失用户各在网时段流失率均值分布')
for index,row in grouped_values.iterrows():
plt.text(index+0.025,tenure_churn_per.values[index],'{:.2%}'.format(tenure_churn_per.values[index],2),fontsize=12)
结论2:
- 流失用户的在网时长较短,多为10个月,且呈左偏分布。
- 新用户在1年内的流失率高于整体流失率。
消费产品指标
fig,axes=plt.subplots(3,3,figsize=(18,18))
for i ,j in enumerate(['PhoneService', 'MultipleLines', 'InternetService','OnlineSecurity',
'OnlineBackup', 'DeviceProtection', 'TechSupport','StreamingTV', 'StreamingMovies']):
plt.subplot(3,3,i+1)
ax=sns.countplot(x=j,hue='Churn',data=df,palette="Set2",order=df_Churn.groupby(j)['Churn'].value_counts().index.levels[0])
plt.title(str(j), fontsize=15)
plt.xlabel('Churn',fontsize=10)
plt.xticks(fontsize=10)
plt.legend(fontsize=10)
lent=df_Churn.groupby(j)['Churn'].value_counts().shape[0]
for p in range(lent):
Rate=df_Churn.groupby(j)['Churn'].value_counts()[p]/df.groupby(j,as_index=False)['Churn'].size()[p]
plt.text(p,100,'流失率{:.2%}'.format(Rate),fontsize=12)
i+=1
结论
- 手机服务中,订购PhoneService的用户群流失率基本上与整体流失率持平,订购MultipleLines的流失率比整体流失率高;
- 网络服务中
1.其中订购InternetService的用户流失率较整体流失率偏高。
2.其中技术性服务(OnlineSecurity、OnlineBackup、DeviceProtection、TechSupport)中订购的用户流失率均比整体流失率低,而未订购的则高出整体流失率不少。
3.其中在娱乐性服务(StreamingTV、StreamingMovies)上,订购的流失率都比整体流失率高。
消费信息指标
收入指标
fig = plt.figure(num=5,figsize=(15,15))
ax1 = fig.add_subplot(3,2,1)
ax1=sns.distplot(df.MonthlyCharges, hist=True,color='green')
ax1.set_ylim(0,0.032)
ax1.set_title("整体客户月消费分布")
ax2 = fig.add_subplot(3,2,2)
ax2=sns.distplot(df.TotalCharges,hist=True,color='green')
ax2.set_title("整体客户整体消费分布")
ax3 = fig.add_subplot(3,2,3)
ax3=sns.distplot(df_Churn['MonthlyCharges'],hist=True)
ax3.set_ylim(0,0.032)
ax3.set_title("流失客户月消费分布")
ax4=fig.add_subplot(3,2,4)
ax4=sns.distplot(df_Churn['TotalCharges'],hist=True)
plt.title('流失用户整体消费分布')
grouped_values=df_Churn.groupby('MonthlyCharges_group').sum().reset_index()
MonthlyCharges_churn_per=df_Churn.groupby('MonthlyCharges_group')['Churn'].count()/df_Churn.groupby('MonthlyCharges_group')['Churn'].count().sum()
ax5=fig.add_subplot(3,2,5)
ax5=MonthlyCharges_churn_per.plot(linestyle='dashed', marker='o')
ax5.axhline(y=Rate_Churn,ls=":",c="red")
ax5.annotate(s='整体流失率均值{:.2%}'.format(Rate_Churn),xy=(0.28,0.25),fontsize=15)
plt.title('流失用户月消费流失率均值分布')
for index,row in grouped_values.iterrows():
plt.text(index+0.025,MonthlyCharges_churn_per.values[index],'{:.2%}'.format(MonthlyCharges_churn_per.values[index],2),fontsize=12)
收入相关指标
fig=plt.figure(num=6,figsize=(15,15))
for i ,j in enumerate(["Contract","PaperlessBilling",'PaymentMethod']):
plt.subplot(3,1,i+1)
ax=sns.countplot(x=j,hue='Churn',data=df,palette="Set2",order=df_Churn.groupby(j)['Churn'].value_counts().index.levels[0])
plt.title(str(j), fontsize=15)
plt.xlabel('Churn',fontsize=15)
plt.xticks(fontsize=10)
plt.legend(fontsize=10)
lent=df_Churn.groupby(j)['Churn'].value_counts().shape[0]
for p in range(lent):
Rate=df_Churn.groupby(j)['Churn'].value_counts()[p]/df.groupby(j,as_index=False)['Churn'].size()[p]
plt.text(p+0.1,100,'流失率{:.2%}'.format(Rate),fontsize=10)
i+=1
结论
- 流失用户的主要来源是月消费80-100元的用户。
- 合同期限上,选择月签的用户流失率最高,达到42.7%,时长越长,流失越低 。
- 选择无纸账单的用户流失率高于选择纸账单的用户。
- 支付上,选择Electronic check支付方式的用户流失最高,达到45%,其他三种流失率相差无几。
总结
以下用户群的流失率是高于整体流失率的:
- 用户维度:老人,单身,无亲属,在网时长小于10个月。
- 产品维度:开通多线服务,开通光纤网络,不开通技术性增值服务,开通娱乐性服务。
- 消费特征:月消费80-100元,选择月签,Electronic check支付。