前情提要:
上期说到免费数据资源,这期讲讲用免费数据来做机器学习的经典案例Iris的Julia实现。
Iris(鸢尾花数据集)介绍:
Edgar Anderson's Iris Data
Description:
This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
该数据集有5个字段,分别是:
- Sepal.Length (萼片长度)
- Sepal.Width (萼片宽度)
- Petal.Length (花瓣长度)
- Petal.Width (花瓣宽度)
- Species (品种):setosa, versicolor, virginica.
机器学习要做的事情就是根据萼片长度、萼片宽度、花瓣长度、花瓣宽度这4个变量来预测判断鸢尾花的品种。
开始准备工作,还是之前的老几样:
- 案例来源:https://github.com/scidom/StatsLearningByExample.jl.git
- Juliabox.com上已经同步
- 打开
07-MachineLearning
目录下的 07-01-Example-IrisFlowers.ipynb - 把代码全部跑一遍
- 此处注意运行到聚类对应的时候要根据上一行的情况来对应
(文末有有运行结果供参考)
案例使用了聚类算法里的K-means算法,笔者把关键核心代码做一些关键总结:
- 首先调用必要的程序包
using Clustering #聚类程序包
using Gadfly #画图程序包
using RDatasets #R语言的数据集程序包
- 调取
iris
数据集给到数据框
IrisFrame = dataset("datasets", "iris")
- 以下这步是关键,调用 K-means算法
result = kmeans(convert(Matrix, IrisFrame[:, 1:4]), 3)
K-means算法不需要目标变量,第1个参数是矩阵,所以要用convert
方法讲数据框的除品种外的字段,也就是第1个字段到第4个字段所有数据转换为矩阵:
convert(Matrix, IrisFrame[:, 1:4])
第2个参数是需要聚类的数量,因为数据集里品种只有3种,所以这个参数写3
- 检查kmeans算法结果是否收敛
result.converged
- 把聚类返回的结果1、2、3对应回不同的品种
classification = Array(AbstractString, length(result.assignments))
classification[result.assignments .== 2] = "setosa"
classification[result.assignments .== 3] = "versicolor"
classification[result.assignments .== 1] = "virginica"
请注意:上面的这个对应关系应该要根据实际运行出来的情况来设置
比如结果是这样的时候:
那么代码要修改成:
classification = Array(AbstractString, length(result.assignments))
classification[result.assignments .== 1] = "setosa" #第1行的数字
classification[result.assignments .== 3] = "versicolor" #除了第一行、最后一行剩下的数字
classification[result.assignments .== 2] = "virginica" #最后一行的数字
会出现这种情况是因为K-means算法是无目标变量的,且算法中含有随机分群种子,因此聚类之后需要人工对应一遍。(也就是K-means只把相同的类别归在一起,具体某一类叫什么要稍微看一下)
[注]这不代表K-means算法不稳定,读者们可以多跑几遍,相同类的数据行总是会被标成同一个类别。
具体可以看看K-means算法相关的介绍。
- 把预测结果和原数据横向合并
AugmentedIrisFrame = hcat(IrisFrame, classification)
- 然后通过图像来看聚类的效果
原始数据 | 预测数据 |
---|---|
目测预测分类绝大部分判断正确
留个小练习:
示例代码并没有计算K-means的有效情况(比如准确率、错分率),
大家可以试着自己计算一下。
Example 07-01: Iris Flowers
In [15]:
using Clustering
using Gadfly
using RDatasets
In [16]:
IrisFrame = dataset("datasets", "iris")
Out[16]:
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | |
---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
7 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
8 | 5.0 | 3.4 | 1.5 | 0.2 | setosa |
9 | 4.4 | 2.9 | 1.4 | 0.2 | setosa |
10 | 4.9 | 3.1 | 1.5 | 0.1 | setosa |
11 | 5.4 | 3.7 | 1.5 | 0.2 | setosa |
12 | 4.8 | 3.4 | 1.6 | 0.2 | setosa |
13 | 4.8 | 3.0 | 1.4 | 0.1 | setosa |
14 | 4.3 | 3.0 | 1.1 | 0.1 | setosa |
15 | 5.8 | 4.0 | 1.2 | 0.2 | setosa |
16 | 5.7 | 4.4 | 1.5 | 0.4 | setosa |
17 | 5.4 | 3.9 | 1.3 | 0.4 | setosa |
18 | 5.1 | 3.5 | 1.4 | 0.3 | setosa |
19 | 5.7 | 3.8 | 1.7 | 0.3 | setosa |
20 | 5.1 | 3.8 | 1.5 | 0.3 | setosa |
21 | 5.4 | 3.4 | 1.7 | 0.2 | setosa |
22 | 5.1 | 3.7 | 1.5 | 0.4 | setosa |
23 | 4.6 | 3.6 | 1.0 | 0.2 | setosa |
24 | 5.1 | 3.3 | 1.7 | 0.5 | setosa |
25 | 4.8 | 3.4 | 1.9 | 0.2 | setosa |
26 | 5.0 | 3.0 | 1.6 | 0.2 | setosa |
27 | 5.0 | 3.4 | 1.6 | 0.4 | setosa |
28 | 5.2 | 3.5 | 1.5 | 0.2 | setosa |
29 | 5.2 | 3.4 | 1.4 | 0.2 | setosa |
30 | 4.7 | 3.2 | 1.6 | 0.2 | setosa |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
In [17]:
plot(IrisFrame, x="PetalWidth", y="SepalWidth", color="Species", Geom.point)
Out[17]:
In [18]:
result = kmeans(convert(Matrix, IrisFrame[:, 1:4]), 3)
Out[18]:
Clustering.KmeansResult{Float64}([6.85385 5.006 5.88361; 3.07692 3.428 2.74098; 5.71538 1.462 4.38852; 2.05385 0.246 1.43443], [2, 2, 2, 2, 2, 2, 2, 2, 2, 2 … 1, 1, 3, 1, 1, 1, 3, 1, 1, 3], [0.01998, 0.20038, 0.17398, 0.27598, 0.03558, 0.45838, 0.17238, 0.00438, 0.65198, 0.14158 … 0.157337, 0.441953, 0.731626, 0.112722, 0.272722, 0.355799, 0.822118, 0.399645, 0.691953, 0.7072], [39, 50, 61], [39.0, 50.0, 61.0], 78.85566582597737, 6, true)
In [19]:
result.converged
Out[19]:
true
In [20]:
result.assignments
Out[20]:
150-element Array{Int64,1}:
2
2
2
2
2
2
2
2
2
2
2
2
2
⋮
3
1
1
1
3
1
1
1
3
1
1
3
In [26]:
classification = Array(AbstractString, length(result.assignments))
classification[result.assignments .== 2] = "setosa"
classification[result.assignments .== 1] = "versicolor"
classification[result.assignments .== 3] = "virginica"
Out[26]:
"virginica"
In [27]:
AugmentedIrisFrame = hcat(IrisFrame, classification)
Out[27]:
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | x1 | |
---|---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | setosa |
6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa | setosa |
7 | 4.6 | 3.4 | 1.4 | 0.3 | setosa | setosa |
8 | 5.0 | 3.4 | 1.5 | 0.2 | setosa | setosa |
9 | 4.4 | 2.9 | 1.4 | 0.2 | setosa | setosa |
10 | 4.9 | 3.1 | 1.5 | 0.1 | setosa | setosa |
11 | 5.4 | 3.7 | 1.5 | 0.2 | setosa | setosa |
12 | 4.8 | 3.4 | 1.6 | 0.2 | setosa | setosa |
13 | 4.8 | 3.0 | 1.4 | 0.1 | setosa | setosa |
14 | 4.3 | 3.0 | 1.1 | 0.1 | setosa | setosa |
15 | 5.8 | 4.0 | 1.2 | 0.2 | setosa | setosa |
16 | 5.7 | 4.4 | 1.5 | 0.4 | setosa | setosa |
17 | 5.4 | 3.9 | 1.3 | 0.4 | setosa | setosa |
18 | 5.1 | 3.5 | 1.4 | 0.3 | setosa | setosa |
19 | 5.7 | 3.8 | 1.7 | 0.3 | setosa | setosa |
20 | 5.1 | 3.8 | 1.5 | 0.3 | setosa | setosa |
21 | 5.4 | 3.4 | 1.7 | 0.2 | setosa | setosa |
22 | 5.1 | 3.7 | 1.5 | 0.4 | setosa | setosa |
23 | 4.6 | 3.6 | 1.0 | 0.2 | setosa | setosa |
24 | 5.1 | 3.3 | 1.7 | 0.5 | setosa | setosa |
25 | 4.8 | 3.4 | 1.9 | 0.2 | setosa | setosa |
26 | 5.0 | 3.0 | 1.6 | 0.2 | setosa | setosa |
27 | 5.0 | 3.4 | 1.6 | 0.4 | setosa | setosa |
28 | 5.2 | 3.5 | 1.5 | 0.2 | setosa | setosa |
29 | 5.2 | 3.4 | 1.4 | 0.2 | setosa | setosa |
30 | 4.7 | 3.2 | 1.6 | 0.2 | setosa | setosa |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
In [28]:
AugmentedIrisFrame[120:150,:]
Out[28]:
SepalLength | SepalWidth | PetalLength | PetalWidth | Species | x1 | |
---|---|---|---|---|---|---|
1 | 6.0 | 2.2 | 5.0 | 1.5 | virginica | virginica |
2 | 6.9 | 3.2 | 5.7 | 2.3 | virginica | versicolor |
3 | 5.6 | 2.8 | 4.9 | 2.0 | virginica | virginica |
4 | 7.7 | 2.8 | 6.7 | 2.0 | virginica | versicolor |
5 | 6.3 | 2.7 | 4.9 | 1.8 | virginica | virginica |
6 | 6.7 | 3.3 | 5.7 | 2.1 | virginica | versicolor |
7 | 7.2 | 3.2 | 6.0 | 1.8 | virginica | versicolor |
8 | 6.2 | 2.8 | 4.8 | 1.8 | virginica | virginica |
9 | 6.1 | 3.0 | 4.9 | 1.8 | virginica | virginica |
10 | 6.4 | 2.8 | 5.6 | 2.1 | virginica | versicolor |
11 | 7.2 | 3.0 | 5.8 | 1.6 | virginica | versicolor |
12 | 7.4 | 2.8 | 6.1 | 1.9 | virginica | versicolor |
13 | 7.9 | 3.8 | 6.4 | 2.0 | virginica | versicolor |
14 | 6.4 | 2.8 | 5.6 | 2.2 | virginica | versicolor |
15 | 6.3 | 2.8 | 5.1 | 1.5 | virginica | virginica |
16 | 6.1 | 2.6 | 5.6 | 1.4 | virginica | versicolor |
17 | 7.7 | 3.0 | 6.1 | 2.3 | virginica | versicolor |
18 | 6.3 | 3.4 | 5.6 | 2.4 | virginica | versicolor |
19 | 6.4 | 3.1 | 5.5 | 1.8 | virginica | versicolor |
20 | 6.0 | 3.0 | 4.8 | 1.8 | virginica | virginica |
21 | 6.9 | 3.1 | 5.4 | 2.1 | virginica | versicolor |
22 | 6.7 | 3.1 | 5.6 | 2.4 | virginica | versicolor |
23 | 6.9 | 3.1 | 5.1 | 2.3 | virginica | versicolor |
24 | 5.8 | 2.7 | 5.1 | 1.9 | virginica | virginica |
25 | 6.8 | 3.2 | 5.9 | 2.3 | virginica | versicolor |
26 | 6.7 | 3.3 | 5.7 | 2.5 | virginica | versicolor |
27 | 6.7 | 3.0 | 5.2 | 2.3 | virginica | versicolor |
28 | 6.3 | 2.5 | 5.0 | 1.9 | virginica | virginica |
29 | 6.5 | 3.0 | 5.2 | 2.0 | virginica | versicolor |
30 | 6.2 | 3.4 | 5.4 | 2.3 | virginica | versicolor |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
In [29]:
levels(AugmentedIrisFrame[:x1])
Out[29]:
3-element Array{AbstractString,1}:
"setosa"
"versicolor"
"virginica"
In [31]:
plot(AugmentedIrisFrame, x="PetalWidth", y="SepalWidth", color="x1", Geom.point)
Out[31]:
KevinZhang
Sep 1, 2018