wide&deep模型的原理就不再具体介绍了。
本文,我们基于该模型实现对电信客户数据集的电信客户流失预测,数据集下载地址为:https://www.kaggle.com/blastchar/telco-customer-churn/download
假设我们已对原始数据做了前期处理,得到的数据如下图所示:
可见,我们已将原始数据中的字符串所代表的取值类型转换成用整数表示了。
下面,我们先准备训练数据。
import tensorflow as tf
import pandas as pd
from sklearn.model_selection import train_test_split
data = pd.read_csv(r'D:\new-telco-customer-churn.csv')
train, test = train_test_split(data, test_size=0.2, random_state=40)
train_y = train.pop('Churn')
test_y = test.pop('Churn')
下面,我们再定义特征列。
# 连续数值特征
tenure = tf.feature_column.numeric_column('tenure')
MonthlyCharges = tf.feature_column.numeric_column('MonthlyCharges')
TotalCharges = tf.feature_column.numeric_column('TotalCharges')
# 离散型特征
CATEGORICAL_COLUMNS = [
'gender', 'SeniorCitizen', 'Partner',
'Dependents', 'PhoneService', 'MultipleLines',
'InternetService', 'OnlineSecurity', 'OnlineBackup',
'DeviceProtection', 'TechSupport', 'StreamingTV',
'StreamingMovies', 'Contract', 'PaperlessBilling',
'PaymentMethod'
]
vocabulary = {}
for feature_name in CATEGORICAL_COLUMNS:
vocabulary[feature_name] = train[feature_name].unique()
tenure_buckets = tf.feature_column.bucketized_column(tenure, boundaries=[0, 6, 12, 24, 36, 48, 80])
MonthlyCharges_buckets = tf.feature_column.bucketized_column(MonthlyCharges, boundaries=[10, 25, 40, 55, 70, 90, 120])
TotalCharges_buckets = tf.feature_column.bucketized_column(TotalCharges, boundaries=[0, 500, 1000, 2000, 4000, 6000, 9000])
# 构建base_columns
base_columns = [tenure_buckets, MonthlyCharges_buckets, TotalCharges_buckets]
for feature_name in CATEGORICAL_COLUMNS:
temp_feature = tf.feature_column.indicator_column(
tf.feature_column.categorical_column_with_vocabulary_list(
feature_name,vocabulary[feature_name])
)
base_columns.append(temp_feature)
crossed_columns = [
tf.feature_column.crossed_column([tenure_buckets, MonthlyCharges_buckets], hash_bucket_size=36),
tf.feature_column.crossed_column([tenure_buckets, TotalCharges_buckets], hash_bucket_size=16)
]
wide_columns = base_columns + crossed_columns
deep_columns = [tenure, MonthlyCharges, TotalCharges]
下面,创建模型和输入函数。
from tensorflow import keras
model_wd = tf.estimator.DNNLinearCombinedClassifier(
linear_feature_columns=wide_columns,
linear_optimizer=keras.optimizers.Ftrl(learning_rate=0.001, l2_regularization_strength=1.0),
dnn_feature_columns=deep_columns,
dnn_optimizer=keras.optimizers.Adagrad(learning_rate=0.1),
dnn_hidden_units=[64,32] # 设置隐藏层的参数
)
def input_fn(X, y, n_epochs=None, shuffle=True):
dataset = tf.data.Dataset.from_tensor_slices((dict(X), y))
if shuffle:
dataset = dataset.shuffle(500)
dataset = dataset.repeat(n_epochs)
dataset = dataset.batch(100)
return dataset
现在,我们可以训练模型了。
model_wd.train(input_fn=lambda:input_fn(train, train_y),max_steps=10000)
然后,在测试集上评估一下效果。
result = model_wd.evaluate(input_fn=lambda:input_fn(test, test_y, shuffle=False, n_epochs=1))
如果效果不错,我们可以应用该模型对新的样本进行 predict 了。