Kaggle初试

第一次上kaggle来做实训，第一印象界面美观，向导友好，考虑周全，是一个比较成熟的平台。
数据集也很丰富，有关于欧洲足球历史比分的，美国总统竞选分析的，计算机语言使用调查的，人力资源分析，历史上的飞机事故统计，IMDB电影得分的数据分析，还有些脱敏的金融借贷和风险信息。

找到了Titanic数据集，跟着向导第一次做任务，DataCamp中的课程有任务说明，可以根据提示写代码，然后提交，错误还可以根据提示进行修正，直到教会你为止。感觉和以前打一个新游戏的任务向导很像。

需求分析

有两部分关于Titanic乘客的信息，一部分是Train数据，一部分是Test数据，通过分析Train特征数据以及标签数据“是否幸存”进行数据清洗，特征选取，建立决策模型来预测Test数据的乘客是否幸存?

字段含义

VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

Python代码

import numpy as np
from sklearn import tree
import pandas as pd

官方安装numpy和scipy库的时候一直报错，后来找到了下载链接，提供了很多非官方的python库
cmd >> python -m pip install xx.whl >> 安装成功。

步骤 1 导入并观察数据


 train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
 train = pd.read_csv(train_url)
 test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
 test = pd.read_csv(test_url)
 #Print the `head` of the train and test dataframes
 print(train.describe())
 print(test.describe())

       PassengerId      Pclass         Age       SibSp       Parch        Fare
count   418.000000  418.000000  332.000000  418.000000  418.000000  417.000000
mean   1100.500000    2.265550   30.272590    0.447368    0.392344   35.627188
std     120.810458    0.841838   14.181209    0.896760    0.981429   55.907576
min     892.000000    1.000000    0.170000    0.000000    0.000000    0.000000
25%     996.250000    1.000000   21.000000    0.000000    0.000000    7.895800
50%    1100.500000    3.000000   27.000000    0.000000    0.000000   14.454200
75%    1204.750000    3.000000   39.000000    1.000000    0.000000   31.500000
max    1309.000000    3.000000   76.000000    8.000000    9.000000  512.329200

发现Fare和Age中有部分值是空值，在进行模型训练之前要对其进行数值填充。

当时的欧洲绅士们倡导女士优先的传统，所以看看性别对于预测标签的影响。

步骤2 分析特征值性别和目标标签的关系

# Passengers that survived vs passengers that passed away
print(train["Survived"].value_counts())

# As proportions
print(train["Survived"].value_counts(normalize = True))

# Males that survived vs males that passed away
print(train["Survived"][train["Sex"] == 'male'].value_counts())

# Females that survived vs Females that passed away
print(train["Survived"][train["Sex"] == 'female'].value_counts())

# Normalized male survival
print(train["Survived"][train["Sex"] == 'male'].value_counts(normalize = True))

# Normalized female survival
print(train["Survived"][train["Sex"] == 'female'].value_counts(normalize = True))

<script.py> output:
    0    549
    1    342
    Name: Survived, dtype: int64
    0    0.616162
    1    0.383838
    Name: Survived, dtype: float64
    0    468
    1    109
    Name: Survived, dtype: int64
    1    233
    0     81
    Name: Survived, dtype: int64
    0    0.811092
    1    0.188908
    Name: Survived, dtype: float64
    1    0.742038
    0    0.257962
    Name: Survived, dtype: float64

通过分析性别和幸存的关系发现，有18%的男性和74%的女性幸存。所以对于测试数据集来说，假如全部判断为女性幸存，理论上也会有74%的正确率，这个是一个基准线。

另外，我们知道年纪小的孩子有优先上救生船的权利。

步骤3 分析特征值年龄和目标标签幸存的关系

为了方便统计以及之后决策树建模的训练，把连续性变量年龄统一成离散型分类变量

# Create the column Child and assign to 'NaN'
train["Child"] = float('NaN')

# Assign 1 to passengers under 18, 0 to those 18 or older. Print the new column.

train["Child"][train["Age"] < 18] = 1
train["Child"][train["Age"] >= 18] = 0

print(train)

# Print normalized Survival Rates for passengers under 18
print(train["Survived"][train["Child"] == 1].value_counts(normalize = True))

# Print normalized Survival Rates for passengers 18 or older
print(train["Survived"][train["Child"] == 0].value_counts(normalize = True))

    1    0.539823
    0    0.460177
    Name: Survived, dtype: float64
    0    0.618968
    1    0.381032
    Name: Survived, dtype: float64

有53%的未成年人幸存，有38%的成年人幸存。

步骤4 数据清洗和数据格式转换

# Convert the male and female groups to integer form
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1
# Impute the Embarked variable
train["Embarked"] = train["Embarked"].fillna("S")

# Convert the Embarked classes to integer form
train["Embarked"][train["Embarked"] == 'S'] = 0
train["Embarked"][train["Embarked"] == 'C'] = 1
train["Embarked"][train["Embarked"] == 'Q'] = 2

为了使得决策树模型能够正常的，高效的工作，一定要对数据进行清洗：

让性别转换成0和1变量
对缺损的年龄字段进行均值填充
把Embarked变量进行格式转换成离散数值变量

一般来说，数据清洗和特征选取占到整个数据分析时间的70%~80%，是预测是否能够准确的重要部分。

步骤5 决策树建模以及训练模型

# Import the Numpy library
import numpy as np
# Import 'tree' from scikit-learn library
from sklearn import tree

# Print the train data to see the available features
print(train)

# Create the target and features numpy arrays: target, features_one
target = train["Survived"].values
features_one = train[["Pclass", "Sex", "Age", "Fare"]].values

# Fit your first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)

# Look at the importance and score of the included features
print(my_tree_one.feature_importances_)
print(my_tree_one.score(features_one, target))

使用了科学计算的库Numpy和机器学习的库sklearn，对特征值进行数据模型拟合。

[ 0.12545743  0.31274009  0.23086653  0.33093596]
0.977553310887

出乎意料之外的是Fare字段对于标签预测的权重作用最大，占到了33%。预测准确率是97%。

步骤6 利用训练模型预测测试数据

# Impute the missing value with the median
test.Fare[152] = test.Fare.median()

# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = test[["Pclass", "Sex", "Age", "Fare"]].values

# Make your prediction using the test set and print them.
my_prediction = my_tree_one.predict(test_features)
print(my_prediction)

对Fare的缺损数据进行均值填充，并且利用训练模型预测测试数据集。

步骤7 把预测结果导出到csv

# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId =np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])
print(my_solution)

# Check that your data frame has 418 entries
print(my_solution.shape)

# Write your solution to a csv file with the name my_solution.csv
my_solution.to_csv("my_solution_one.csv", index_label = ["PassengerId"])

补充1：决策树参数调整

# Create a new array with the added features: features_two
features_two = train[["Pclass","Age","Sex","Fare", "SibSp", "Parch", "Embarked"]].values

#Control overfitting by setting "max_depth" to 10 and "min_samples_split" to 5 : my_tree_two
max_depth = 10
min_samples_split = 5
my_tree_two = tree.DecisionTreeClassifier(max_depth = 10, min_samples_split = 5, random_state = 1)
my_tree_two = my_tree_two.fit(features_two, target)

#Print the score of the new decison tree
print(my_tree_two.feature_importances_)
print(my_tree_two.score(features_two, target))

Maybe we can improve the overfit model by making a less complex model? In DecisionTreeRegressor
, the depth of our model is defined by two parameters: - the max_depth
parameter determines when the splitting up of the decision tree stops. - the min_samples_split
parameter monitors the amount of observations in a bucket. If a certain threshold is not reached (e.g minimum 10 passengers) no further splitting can be done.

为了避免有可能出现的决策树过拟合的可能，我们需要对决策树进行“剪枝”，有下列参数可以优化调整

**max_features: **选择最适属性时划分的特征不能超过此值。
max_depth: (default=None)设置树的最大深度，默认为None，这样建树时，会使每一个叶节点只有一个类别，或是达到min_samples_split。
min_samples_split****:根据属性划分节点时，每个划分最少的样本数。
min_samples_leaf:叶子节点最少的样本数。
**max_leaf_nodes: **(default=None)叶子树的最大样本数。

补充2：特征工程--尝试建立新的特征值

Data Science is an art that benefits from a human element. Enter feature engineering: creatively engineering your own features by combining the different existing variables.
While feature engineering is a discipline in itself, too broad to be covered here in detail, you will have a look at a simple example by creating your own new predictive attribute: family_size

# Create train_two with the newly defined feature
train_two = train.copy()
train_two["family_size"] = train_two["SibSp"] + train_two["Parch"] + 1
print(train_two["family_size"])

# Create a new feature set and add the new feature
features_three = train_two[["Pclass", "Sex", "Age", "Fare", "SibSp", "Parch", "family_size"]].values

# Define the tree classifier, then fit the model
my_tree_three = tree.DecisionTreeClassifier()
my_tree_three = my_tree_three.fit(features_three,target)

# Print the score of this decision tree
print(my_tree_three.score(features_three, target))
print(my_tree_three.feature_importances_)

补充3：使用新的模型算法--随机森林

A detailed study of Random Forests would take this tutorial a bit too far. However, since it's an often used machine learning technique, gaining a general understanding in Python won't hurt.

In layman's terms, the Random Forest technique handles the overfitting problem you faced with decision trees. It grows multiple (very deep) classification trees using the training set. At the time of prediction, each tree is used to come up with a prediction and every outcome is counted as a vote. For example, if you have trained 3 trees with 2 saying a passenger in the test set will survive and 1 says he will not, the passenger will be classified as a survivor. This approach of overtraining trees, but having the majority's vote count as the actual classification decision, avoids overfitting.

随机森林就是会生成很多决策树，然后每棵树都会对最终预测进行投票，取投票比例多的作为最终预测结果，避免了过渡拟合。

优点：

a. 在数据集上表现良好，两个随机性的引入，使得随机森林不容易陷入过拟合

b. 在当前的很多数据集上，相对其他算法有着很大的优势，两个随机性的引入，使得随机森林具有很好的抗噪声能力

c. 它能够处理很高维度（feature很多）的数据，并且不用做特征选择，对数据集的适应能力强：既能处理离散型数据，也能处理连续型数据，数据集无需规范化

d. 可生成一个Proximities=（pij）矩阵，用于度量样本之间的相似性： pij=aij/N, aij表示样本i和j出现在随机森林中同一个叶子结点的次数，N随机森林中树的颗数

e. 在创建随机森林的时候，对generlization error使用的是无偏估计

f. 训练速度快，可以得到变量重要性排序（两种：基于OOB误分率的增加量和基于分裂时的GINI下降量

g. 在训练过程中，能够检测到feature间的互相影响

h. 容易做成并行化方法

i. 实现比较简单

# Import the `RandomForestClassifier`
from sklearn.ensemble import RandomForestClassifier

# We want the Pclass, Age, Sex, Fare,SibSp, Parch, and Embarked variables
features_forest = train[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values

# Building and fitting my_forest
forest = RandomForestClassifier(max_depth = 10, min_samples_split=2, n_estimators = 100, random_state = 1)
my_forest = forest.fit(features_forest, target)

# Print the score of the fitted random forest
print(my_forest.score(features_forest, target))

# Compute predictions on our test set features then print the length of the prediction vector
test_features = test[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch", "Embarked"]].values
pred_forest = my_forest.predict(test_features)
print(len(pred_forest))

总结：

把整个数据分析的流程走了一遍，由于本身的数据质量较高，业务场景相对简单，所以模型预测效果好，没有遇到实际问题。
另外，本次使用的建模和训练模型的过程都是基于sklearn，直接调用方法，如果有时间的话，最好还是自己把建模的过程用python实现一遍，这样可以更加好的理解决策树模型。
还有就是以后可以学习一下可视化数据的方法，比如应用matplotlib去对数据有一个直观的了解和观测，像这样：

ZT}1H0_%5FW(`L@TQ258$IW.png

最后编辑于：2017.12.05 00:48:52

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 216,402评论 6赞 499
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 92,377评论 3赞 392
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 162,483评论 0赞 353
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,165评论 1赞 292
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,176评论 6赞 388
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,146评论 1赞 297
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,032评论 3赞 417
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 38,896评论 0赞 274
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,311评论 1赞 310
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,536评论 2赞 332
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 39,696评论 1赞 348
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,413评论 5赞 343
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,008评论 3赞 325
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,659评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 32,815评论 1赞 269
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 47,698评论 2赞 368
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,592评论 2赞 353