kaggle:Titanic 决策树算法

只是初步了解如何使用决策树算法,对泰坦尼克号灾难还可以继续深挖下去。

import pandas as pd
# 数据加载
train_data = pd.read_csv(r'C:\Users\wxw\Downloads\train.csv')
test_data = pd.read_csv(r'C:\Users\wxw\Downloads\test.csv')

# 数据探索
print(train_data.info())
print('_'*30)
print(train_data.describe())
print('-'*30)
print(train_data.describe(include=['O']))
print('-'*30)
print(train_data.head())
print('-'*30)
print(train_data.tail())

# 使用平均年龄来填充年龄中nan值
train_data['Age'].fillna(train_data['Age'].mean(),inplace=True)
test_data['Age'].fillna(test_data['Age'].mean(),inplace= True)

# 使用票价的均值填充票价中nan值
train_data['Fare'].fillna(train_data['Fare'].mean(),inplace=True)
test_data['Fare'].fillna(test_data['Fare'].mean(),inplace=True)

# 观察Embarked 的取值
print(train_data['Embarked'].value_counts())
# 使用登录最多的港口来填充登录港口的nan值
train_data['Embarked'].fillna('S',inplace=True)
train_data['Embarked'].fillna('S',inplace=True)
# 特征选择
features = ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']
train_features = train_data[features]
train_labels = train_data['Survived']
test_features =test_data[features]
from sklearn.feature_extraction import DictVectorizer
dvec = DictVectorizer(sparse =False)
train_features = dvec.fit_transform(train_features.to_dict(orient='record'))

print(dvec.feature_names_)
from sklearn.tree import DecisionTreeClassifier
# 构造ID3决策树
clf = DecisionTreeClassifier(criterion = 'entropy')
# 决策树训练
clf.fit(train_features,train_labels)
test_features = dvec.transform(test_features.to_dict(orient='record'))
# 决策树预测
pred_labels = clf.predict(test_features)
acc_decision_tree = round(clf.score(train_features,train_labels),6)
print(u'score准确率为%.4lf'% acc_decision_tree)
import numpy as np
from sklearn.model_selection import cross_val_score
# 使用K折交叉验证 统计决策树准确率
print(u'cross_val_score准确率为 %.4lf' % np.mean(cross_val_score(clf, train_features, train_labels, cv=10)))
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

友情链接更多精彩内容