4.1
决策树是根据属性来进行划分的,每一个叶结点代表一个类。而决策树算法在三种情况下会停止划分:①当前结点包含的样本全属于同一类别,无需划分。②当前属性集为空,或是所有样本在所有属性集上取值相同,无法划分。③当前结点包含的样本集合为空,不能划分。
而题中假设不含冲突数据,说明只要特征属性相同那么其标记也相同,也就符合第一种情况,属性相同的样本被放到同一个结点且标记相同无需再划分了。一个结点只包含一种类别的样本,那么其训练误差为0 。
若是存在冲突数据,当属性完全相同的样本来到该节点时,则会出现上面第二种情况,属性集上取值一样,无法划分,此时会根据该节点中所有样本,看哪种类别占比多,则将该节点分为哪种类别,这是就出现了训练误差。
4.2
最小训练误差是求误差最小化,在样本有限的情况下,若根据最小训练误差来进行树的节点划分,只要节点数越接近样本数那么训练误差会越小,这样训练出来的决策树会过拟合,从而缺乏泛化能力。
4.3
代码是看网上大佬写的,看完把自己不理解和不清楚的函数加了点注释。
import collections
from math import log
import operator
#通用项,计算给定数据集的信息熵(香农熵)
def calcShannonEnt(dataSet): #dataset样本集
"""
:param dataSet:
:return:
"""
# 计算出数据集的总数
numEntries = len(dataSet)
# 用来统计标签
labelCounts = collections.defaultdict(int)
"""
default(int)则创建一个类似dictionary对象,里面任何的values都是int的实例,而
且就算是一个不存在的key, d[key] 也有一个默认值,这个默认值是int()的默认值0.
"""
# 循环整个数据集,得到数据的分类标签的数量
for featVec in dataSet:
# 得到当前的标签
currentLabel = featVec[-1]
# 将对应的标签值加一
labelCounts[currentLabel] += 1
# 默认的信息熵
shannonEnt = 0.0
for key in labelCounts:
# 计算出当前分类标签占总标签的比例数
prob = float(labelCounts[key]) / numEntries
# 以2为底求对数
shannonEnt -= prob * log(prob, 2)
#返回每一种分类的所占比乘以其对数的值之和,即加总pk*log2pk
return shannonEnt
'''
处理连续特征
'''
#按照给定的数值,将数据集分为不大于和大于两部分
def splitDataSetForSeries(dataSet, axis, value):
"""
:param dataSet: 要划分的数据集
:param axis: 特征值所在的下标
:param value: 划分值
:return:
"""
# 用来保存不大于划分值的集合
eltDataSet = []
# 用来保存大于划分值的集合
gtDataSet = []
# 进行划分,保留该特征值
for feat in dataSet:
if feat[axis] <= value:
eltDataSet.append(feat)
else:
gtDataSet.append(feat)
return eltDataSet, gtDataSet
#计算连续值的信息增益
def calcInfoGainForSeries(dataSet, i, baseEntropy):
"""
:param dataSet:整个数据集
:param i: 对应的特征值下标
:param baseEntropy: 基础信息熵
:return: 返回一个信息增益值,和当前的划分点
"""
# 记录最大的信息增益
maxInfoGain = 0.0
# 最好的划分点
bestMid = -1
# 得到数据集中所有的当前特征值列表
featList = [example[i] for example in dataSet]
# 得到分类列表
classList = [example[-1] for example in dataSet]
dictList = dict(zip(featList, classList))
#将上面的两个列表打包为元组返回其列表对象再转换为字典
# 将其从小到大排序,按照连续值的大小排列
sortedFeatList = sorted(dictList.items(), key = operator.itemgetter(0))
# 计算连续值有多少个
numberForFeatList = len(sortedFeatList)
# 计算划分点,保留三位小数
midFeatList = [round((sortedFeatList[i][0] + sortedFeatList[i+1][0])/2.0, 3)for i in range(numberForFeatList - 1)]
# 计算出各个划分点信息增益
for mid in midFeatList:
# 将连续值划分为不大于当前划分点和大于当前划分点两部分
eltDataSet, gtDataSet = splitDataSetForSeries(dataSet, i, mid)
# 计算两部分的特征值熵和权重的乘积之和
newEntropy = len(eltDataSet)/len(sortedFeatList)*calcShannonEnt(eltDataSet) + len(gtDataSet)/len(sortedFeatList)*calcShannonEnt(gtDataSet)
# 计算出信息增益
infoGain = baseEntropy - newEntropy
# print('当前划分值为:' + str(mid) + ',此时的信息增益为:' + str(infoGain))
if infoGain > maxInfoGain:
bestMid = mid
maxInfoGain = infoGain
return maxInfoGain, bestMid
#返回当前最大的信息熵和最大信息熵对应的划分值
'''
处理离散特征
'''
#按照给定的特征值,将数据集划分
def splitDataSet(dataSet, axis, value):
"""
:param dataSet: 数据集
:param axis: 给定特征值的下标
:param value: 给定特征值满足的条件,只有给定特征值等于这个value的时候才会返回
:return:
"""
# 创建一个新的列表,防止对原来的列表进行修改
retDataSet = []
# 遍历整个数据集
for featVec in dataSet:
# 如果给定特征值等于想要的特征值
if featVec[axis] == value:
# 将该特征值前面的内容保存起来
reducedFeatVec = featVec[:axis]
# 将该特征值后面的内容保存起来,所以将给定特征值给去掉了
reducedFeatVec.extend(featVec[axis + 1:])
# 添加到返回列表中
retDataSet.append(reducedFeatVec)
return retDataSet
#计算信息增益
def calcInfoGain(dataSet ,featList, i, baseEntropy):
"""
:param dataSet: 数据集
:param featList: 当前特征列表
:param i: 当前特征值下标
:param baseEntropy: 通用信息熵
:return:
"""
# 将当前特征唯一化,也就是说当前特征值中共有多少种
uniqueVals = set(featList)
# 新的熵,代表当前特征值的熵
newEntropy = 0.0
# 遍历现在有的特征的可能性
for value in uniqueVals:
# 在全部数据集的当前特征位置上,找到该特征值等于当前值的集合
subDataSet = splitDataSet(dataSet = dataSet, axis = i, value = value)
# 计算出权重
prob = len(subDataSet) / float(len(dataSet))
# 计算出当前特征值的熵
newEntropy += prob * calcShannonEnt(subDataSet)
# 计算出“信息增益”
infoGain = baseEntropy - newEntropy
return infoGain
#找到次数最多的类别标签
def majorityCnt(classList):
"""
:param classList:
:return:
"""
# 用来统计标签的票数
classCount = collections.defaultdict(int)
# 遍历所有的标签类别
for vote in classList:
classCount[vote] += 1
# 从大到小排序
sortedClassCount = sorted(classCount.items(), key = operator.itemgetter(1), reverse = True)
# 返回次数最多的标签
return sortedClassCount[0][0]
'''
通用项
'''
#选择最好的数据集划分特征,根据信息增益值来计算,可处理连续值
def chooseBestFeatureToSplit(dataSet, labels):#只返回一个
"""
:param dataSet:
:return:
"""
# 得到数据的特征值总数,利用第一行的数据个数减去一个标记个数
numFeatures = len(dataSet[0]) - 1
# 计算出基础信息熵
baseEntropy = calcShannonEnt(dataSet)
# 基础信息增益为0.0
bestInfoGain = 0.0
# 最好的特征值
bestFeature = -1
# 标记当前最好的特征值是不是连续值:1连续,2离散
flagSeries = 0
# 如果是连续值的话,用来记录连续值的划分点
bestSeriesMid = 0.0
# 对每个特征值进行求信息熵
for i in range(numFeatures):
# 得到数据集中所有的当前特征值列表
featList = [example[i] for example in dataSet]
#isinstance() 函数来判断一个对象是否是一个已知的类型
if isinstance(featList[0], str):#判断两个类型是否相同->是的话就是离散
infoGain = calcInfoGain(dataSet, featList, i, baseEntropy)#->infoGain
else:
infoGain, bestMid = calcInfoGainForSeries(dataSet, i, baseEntropy)#->maxInfoGain, bestMid#bestMid是最佳的特征值中点
# 如果当前的信息增益比原来的大
if infoGain > bestInfoGain:
# 最好的信息增益
bestInfoGain = infoGain
# 新的最好的用来划分的特征值
bestFeature = i
flagSeries = 0
if not isinstance(dataSet[0][bestFeature], str):
flagSeries = 1
bestSeriesMid = bestMid
# print('信息增益最大的特征为:' + labels[bestFeature])
if flagSeries:
return bestFeature, bestSeriesMid
else:
return bestFeature
#创建决策树
def createTree(dataSet, labels):
"""
:param dataSet: 数据集
:param labels: 特征标签
:return:
"""
# 拿到所有数据集的分类标签
classList = [example[-1] for example in dataSet]
# 统计第一个标签出现的次数,与总标签个数比较,如果相等则说明当前列表中全部都是一种标签,此时停止划分
if classList.count(classList[0]) == len(classList):
return classList[0]
# 计算第一行有多少个数据,如果只有一个的话说明所有的特征属性都遍历完了,剩下的一个就是类别标签
if len(dataSet[0]) == 1:
# 返回剩下标签中出现次数较多的那个
return majorityCnt(classList)
# 选择最好的划分特征,得到该特征的下标
bestFeat = chooseBestFeatureToSplit(dataSet = dataSet, labels = labels)
# 得到最好特征的名称
bestFeatLabel = ''
# 记录此刻是连续值还是离散值,1连续,2离散
flagSeries = 0
# 如果是连续值,记录连续值的划分点
midSeries = 0.0
# 如果是元组的话,说明此时是连续值
if isinstance(bestFeat, tuple):
# 重新修改分叉点信息
bestFeatLabel = str(labels[bestFeat[0]]) + '不大于' + str(bestFeat[1]) + '?'#连续值返回数组
# 得到当前的划分点
midSeries = bestFeat[1]
# 得到下标值
bestFeat = bestFeat[0]
# 连续值标志
flagSeries = 1
else:
# 得到分叉点信息
bestFeatLabel = labels[bestFeat]
# 离散值标志
flagSeries = 0
# 使用一个字典来存储树结构,分叉处为划分的特征名称
myTree = {bestFeatLabel: {}}
# 得到当前特征标签的所有可能值
featValues = [example[bestFeat] for example in dataSet]
# 连续值处理
if flagSeries:
# 将连续值划分为不大于当前划分点和大于当前划分点两部分
eltDataSet, gtDataSet = splitDataSetForSeries(dataSet, bestFeat, midSeries)
# 得到剩下的特征标签
subLabels = labels[:]
# 递归处理不大于划分点的子树
subTree = createTree(eltDataSet, subLabels)
myTree[bestFeatLabel]['不大于'] = subTree
# 递归处理大于当前划分点的子树
subTree = createTree(gtDataSet, subLabels)
myTree[bestFeatLabel]['大于'] = subTree
return myTree
# 离散值处理
else:
# 将本次划分的特征值从列表中删除掉
del (labels[bestFeat])
# 唯一化,去掉重复的特征值
uniqueVals = set(featValues)
# 遍历所有的特征值
for value in uniqueVals:
# 得到剩下的特征标签
subLabels = labels[:]
# 递归调用,将数据集中该特征等于当前特征值的所有数据划分到当前节点下,递归调用时需要先将当前的特征去除掉
subTree = createTree(splitDataSet(dataSet = dataSet, axis = bestFeat, value = value), subLabels)
# 将子树归到分叉处下
myTree[bestFeatLabel][value] = subTree
return myTree
#创建数据集
def createDataSet():
"""
创建测试的数据集,里面的数值中具有连续值
:return:
"""
dataSet = [
# 1
['青绿', '蜷缩', '浊响', '清晰', '凹陷', '硬滑', 0.697, 0.460, '好瓜'],
# 2
['乌黑', '蜷缩', '沉闷', '清晰', '凹陷', '硬滑', 0.774, 0.376, '好瓜'],
# 3
['乌黑', '蜷缩', '浊响', '清晰', '凹陷', '硬滑', 0.634, 0.264, '好瓜'],
# 4
['青绿', '蜷缩', '沉闷', '清晰', '凹陷', '硬滑', 0.608, 0.318, '好瓜'],
# 5
['浅白', '蜷缩', '浊响', '清晰', '凹陷', '硬滑', 0.556, 0.215, '好瓜'],
# 6
['青绿', '稍蜷', '浊响', '清晰', '稍凹', '软粘', 0.403, 0.237, '好瓜'],
# 7
['乌黑', '稍蜷', '浊响', '稍糊', '稍凹', '软粘', 0.481, 0.149, '好瓜'],
# 8
['乌黑', '稍蜷', '浊响', '清晰', '稍凹', '硬滑', 0.437, 0.211, '好瓜'],
# ----------------------------------------------------
# 9
['乌黑', '稍蜷', '沉闷', '稍糊', '稍凹', '硬滑', 0.666, 0.091, '坏瓜'],
# 10
['青绿', '硬挺', '清脆', '清晰', '平坦', '软粘', 0.243, 0.267, '坏瓜'],
# 11
['浅白', '硬挺', '清脆', '模糊', '平坦', '硬滑', 0.245, 0.057, '坏瓜'],
# 12
['浅白', '蜷缩', '浊响', '模糊', '平坦', '软粘', 0.343, 0.099, '坏瓜'],
# 13
['青绿', '稍蜷', '浊响', '稍糊', '凹陷', '硬滑', 0.639, 0.161, '坏瓜'],
# 14
['浅白', '稍蜷', '沉闷', '稍糊', '凹陷', '硬滑', 0.657, 0.198, '坏瓜'],
# 15
['乌黑', '稍蜷', '浊响', '清晰', '稍凹', '软粘', 0.360, 0.370, '坏瓜'],
# 16
['浅白', '蜷缩', '浊响', '模糊', '平坦', '硬滑', 0.593, 0.042, '坏瓜'],
# 17
['青绿', '蜷缩', '沉闷', '稍糊', '稍凹', '硬滑', 0.719, 0.103, '坏瓜']
]
# 特征值列表
labels = ['色泽', '根蒂', '敲击', '纹理', '脐部', '触感', '密度', '含糖率']
# 特征对应的所有可能的情况
return dataSet, labels#, labels_full
#主函数
if __name__ == '__main__':
"""
处理连续值时候的决策树
"""
dataSet, labels = createDataSet()
myTree = createTree(dataSet, labels)
print(myTree)
运行结果:
4.4
同样是看网上其他大佬的代码,这一个代码它的预剪枝和后剪枝并没有使用基尼值为判断,而是根据划分前后预测的错误率来进行剪枝的。
import numpy as np
dataset = np.array([[0,0,0,0,0,0,1],
[1,0,1,0,0,0,1],
[1,0,0,0,0,0,1],
[0,0,1,0,0,0,1],
[2,0,0,0,0,0,1],
[0,1,0,0,1,1,1],
[1,1,0,1,1,1,1],
[1,1,0,0,1,0,1],
[1,1,1,1,1,0,0],
[0,2,2,0,2,1,0],
[2,2,2,2,2,0,0],
[2,0,0,2,2,1,0],
[0,1,0,1,0,0,0],
[2,1,1,1,0,0,0],
[1,1,0,0,1,1,0],
[2,0,0,2,2,0,0],
[0,0,1,1,1,0,0]]
)
#取得训练数据
train_dataset=[]
a = set(range(17)) #set() 函数创建一个无序不重复元素集,可进行关系测试,删除重复数据,还可以计算交集、差集、并集等。
b = {3,4,7,8,10,11,12}
c = a-b
for i in list(c):
train_dataset.append(dataset[i])
train_dataset =np.array(train_dataset) #转换为数组
#取得测试数据
test_dataset=[]
for i in [3,4,7,8,10,11,12]:
test_dataset.append(dataset[i])
test_dataset= np.array(test_dataset)
#4.3 基尼指数生成决策树2.0(CART)
#计算特征值对应的Gini指数,计算一个特征的一个特征值对应的Gini
def calGini(dataset):
count1 = 0 #用来记录标记为1,即好瓜的样本数
count0 =0 #记录坏瓜的样本数
numdataset = len(dataset)
for ai_dataset in dataset:
ai_label = ai_dataset[-1]
if ai_label == 1:
count1 +=1
else:count0 +=1
prob1 = float(count1)/numdataset
prob0 = float(count0)/numdataset
Gini = 1 - (prob1**2+prob0**2)
return(Gini)
#计算特征的Gini指数,主要就是计算每个特征的所占比例
def calAGain(dataset,i): #i为第i个特征
dataset =np.array(dataset)
a_dataset = dataset[:,i] #取数组全部行的第i列
values = list(set(a_dataset)) #得到第i个属性的全部特征值
Dv_Gini=0
#记录一个属性不同特征值所占比例
for ai in values:
ai_dataset = []
for j in range(len(dataset)):
if ai == a_dataset[j]:
ai_dataset.append(dataset[j])
num_ai=len(ai_dataset)
Gini = calGini(ai_dataset)
prob = float(num_ai)/len(dataset)
Dv_Gini += prob*Gini
return Dv_Gini
#选择Gini最小的特征
def choosenbestGini(dataset):
numA = len(dataset[0])
a_Gini_dict = {}
for i in range(numA-1):
a_Gini_dict[i] = calAGain(dataset,i)
a_Gini_list = list(sorted(a_Gini_dict.items(),key=lambda x:x[1]))[0]
min_a_Gini_i = a_Gini_list[0]
min_a_Gini_value = a_Gini_list[1]
return min_a_Gini_i,min_a_Gini_value #返回最小gini值和其划分的属性值
#去除父节点的特征数据
def splitdataset(dataset,axis,value):
retdataset = []
for featvec in dataset:
if featvec[axis] == value:
reducefeatvec = np.hstack((featvec[:axis],featvec[axis+1:]))
retdataset.append(reducefeatvec)
return retdataset
#创建label列表
labels = ['color','root','sound','textile','belly','feel']
#生成决策树
def TREE(dataset,labels):
classlist = [example[-1] for example in dataset]
if classlist.count(classlist[0]) == len(classlist): #判断当前的样本是不是全部属于同一类别,如果是不再划分返回当前类别。
return classlist[0]
bestA,bestGini= choosenbestGini(dataset)
bestfeatlabel = labels[bestA]
mytree = {bestfeatlabel:{}}
del(labels[bestA])
featvalues = [example[bestA] for example in dataset]
uniquevals = set(featvalues)
for value in uniquevals:
sublabels = labels[:]
mytree[bestfeatlabel][value] = TREE(splitdataset(dataset,bestA,value),sublabels)
return mytree
a = TREE(train_dataset,labels)
#print(a)
#{'color': {0: {'sound':
# {0: 1,
# 1: 0,
# 2: 0}},
# 1: {'root':
# {0: 1,
# 1: {
# 'textile':
# {0: 0,
# 1: 1}}}},
# 2: 0}}
test_label = test_dataset[-1]
train_test_label=[]
#根据生成的树编的分辨器,例题这种深度少的还可以这么写,深度多了估计有别的方便写法吧
#创建特征名字的字典
labels = ['color','root','sound','textile','belly','feel']
labels_dict = {}
for i in labels:
labels_dict[i] =labels.index(i)
def train_test_label(dataset):
train_test_label_list = []
for vector in dataset:
if vector[labels_dict['color']] == 0:
if vector[labels_dict['sound']] == 0:
train_test_label_list.append(1)
else:train_test_label_list.append(0)
elif vector[labels_dict['color']] == 1:
if vector[labels_dict['root']] == 0:
train_test_label_list.append(1)
elif vector[labels_dict['root']] == 1:
if vector[labels_dict['textile']] == 0:
train_test_label_list.append(0)
elif vector[labels_dict['textile']] == 1:
train_test_label_list.append(0)
else:train_test_label_list.append(5)
else:train_test_label_list.append(5)
else:train_test_label_list.append(0)
return train_test_label_list
#得到测试的结果
test_result0 = train_test_label(test_dataset)
#计算正确率的函数
def calrightrate(test_result,real_result):
num = len(test_result) #测试数据的向量数
count =0
for i in range(num):
if test_result[i] == real_result[i]:
count += 1
right_rate = float(count)/num #计算出初始的正确率
return right_rate
#预剪枝
#直接定义为结点不再划分的正确率
#color层 color=? 定义为1(好) 测试集的概率
num=len(test_dataset)
num0floor=list(test_dataset[-1]).count(1)
rightrate = float(num0floor)/num
#color 0,1,2 : 1,1,0(好 好 坏)
#color值对应的好瓜坏瓜,应该根据训练集color 0/1/2 对应的好瓜坏瓜数量确定 取多的定义
#这里就不写程序计算训练集每个结点的好坏个数了
#计算测试集剪枝后的正确率
num1floor = 0
for vector in test_dataset:
if vector[labels_dict['color']] == 0 and vector[-1] ==1:
num1floor +=1
elif vector[labels_dict['color']] == 1 and vector[-1] ==1:
num1floor +=1
elif vector[labels_dict['color']] == 2 and vector[-1] ==0:
num1floor +=1
right1rate = float(num1floor)/num
def cnt(right0,right1):
if right1rate <= rightrate:
return '保留'
else: return '剪去'
print(cnt(rightrate,right1rate))
#后剪枝
#用了一个很慢的方法做后剪枝,剪完一个枝直接得到了100%的正确率(主要还是因为数据太少了),
def latejudge(dataset):
all_test_result = np.array(train_test_label(dataset)) #决策树预测的结果
real_result = dataset[-1] #实际结果
#计算未剪枝的正确率
right_rate = calrightrate(all_test_result,real_result)
print(right_rate)
judge_dataset =[]
label_num =[]
m=0
for vector in dataset:
m+=1
if vector[labels_dict['color']] == 0 and vector[labels_dict['root']]==1:
judge_dataset.append(vector)
label_num.append(m-1)
count0 =list(judge_dataset[-1]).count(0)
count1 =list(judge_dataset[-1]).count(1)
if count0>count1:
a=0
elif count0 == count1:
a = np.random.randint(0,2,1)
else:a=1
for n in label_num:
test_label[n] = a
cnt1tree_rate = calrightrate(test_label,real_result) #计算减枝后的正确率
if cnt1tree_rate > right_rate:
print(cnt1tree_rate)
latejudge(test_dataset)
4.5
对数几率回归:实验数据:
编号,色泽,根蒂,敲声,纹理,脐部,触感,密度,含糖率,好瓜
1,青绿,蜷缩,浊响,清晰,凹陷,硬滑,0.697,0.46,是
2,乌黑,蜷缩,沉闷,清晰,凹陷,硬滑,0.774,0.376,是
3,乌黑,蜷缩,浊响,清晰,凹陷,硬滑,0.634,0.264,是
4,青绿,蜷缩,沉闷,清晰,凹陷,硬滑,0.608,0.318,是
5,浅白,蜷缩,浊响,清晰,凹陷,硬滑,0.556,0.215,是
6,青绿,稍蜷,浊响,清晰,稍凹,软粘,0.403,0.237,是
7,乌黑,稍蜷,浊响,稍糊,稍凹,软粘,0.481,0.149,是
8,乌黑,稍蜷,浊响,清晰,稍凹,硬滑,0.437,0.211,是
9,乌黑,稍蜷,沉闷,稍糊,稍凹,硬滑,0.666,0.091,否
10,青绿,硬挺,清脆,清晰,平坦,软粘,0.243,0.267,否
11,浅白,硬挺,清脆,模糊,平坦,硬滑,0.245,0.057,否
12,浅白,蜷缩,浊响,模糊,平坦,软粘,0.343,0.099,否
13,青绿,稍蜷,浊响,稍糊,凹陷,硬滑,0.639,0.161,否
14,浅白,稍蜷,沉闷,稍糊,凹陷,硬滑,0.657,0.198,否
15,乌黑,稍蜷,浊响,清晰,稍凹,软粘,0.36,0.37,否
16,浅白,蜷缩,浊响,模糊,平坦,硬滑,0.593,0.042,否
17,青绿,蜷缩,沉闷,稍糊,稍凹,硬滑,0.719,0.103,否
实验代码:
import plotTree
import numpy as np
import pandas as pd
D_keys = {
'色泽': ['青绿', '乌黑', '浅白'],
'根蒂': ['蜷缩', '硬挺', '稍蜷'],
'敲声': ['清脆', '沉闷', '浊响'],
'纹理': ['稍糊', '模糊', '清晰'],
'脐部': ['凹陷', '稍凹', '平坦'],
'触感': ['软粘', '硬滑'],
'好瓜': ['否', '是'],
}
class_name = '好瓜'
names = ['色泽', '根蒂', '敲声', '纹理', '脐部', '触感', '密度', '含糖率', 'b']
def p1_function(xi, beta):
beta_T_x = np.dot(beta.T, xi)
exp_beta_T_x = np.exp(beta_T_x)
return exp_beta_T_x / (1+exp_beta_T_x)
def l_function(beta, x, y):
#计算当前3.27式的l值,关于β的高阶可导连续凸函数
result = 0
for xi, yi in zip(x, y): #生成一个元组返回其列表形式
xi = xi.reshape(xi.shape[0], 1) #reshape()是数组对象中的方法,用于改变数组的形状
# beta.T与x相乘, beta_T_x表示β转置乘以x
beta_T_x = np.dot(beta.T, xi) #dot()函数是矩阵乘
exp_beta_T_x = np.exp(beta_T_x) #返回指数形式
result += -yi*beta_T_x + np.log(1+exp_beta_T_x)
return result
#对3.27式使用牛顿法 进行求解得到β
def run(x, y, iterate=100):
#定义初始参数
#β列向量
beta = np.zeros((x.shape[1], 1))
beta[-1] = 1
old_l = 0 #3.27式l值的记录,这是上一次迭代的l值
cur_iter = 0
while cur_iter < iterate:
cur_iter += 1
cur_l = l_function(beta, x, y)
# 迭代终止条件
if np.abs(cur_l - old_l) <= 10e-5:
# 精度,二者差在0.00001以内就认为收敛
# 满足条件直接跳出循环
break
# print(cur_l)
old_l = cur_l
d1_beta, d2_beta = 0, 0
# 牛顿迭代法更新β
for xi, yi in zip(x, y):
xi = xi.reshape(xi.shape[0], 1)
p1 = p1_function(xi, beta)
# 求关于β的一阶导数
d1_beta -= np.dot(xi, yi-p1)
# 求关于β的二阶导数
d2_beta += np.dot(xi, xi.T) * p1 * (1-p1)
#print(beta)
try:
beta = beta - np.dot(np.linalg.inv(d2_beta), d1_beta)
except Exception as e:
break
return beta
# 读取数据
def loadData(filename):
dataSet = pd.read_csv(filename)
return dataSet
#数据预处理:将离散值转换为连续值
def processData(dataSet):
dataSet['b'] = 1
for key, value in D_keys.items():
dataSet[key].replace(value, list(range(len(value))), inplace=True)
# dataSet.drop(columns=['编号'], inplace=True)
# 判断D中的样本在A上的取值是否相同
def same_value(D):
for key in D.columns:
if len(D[key].value_counts()) > 1:
return False
return True
# 叶节点选择其类别为D中样本最多的类
def choose_largest_example(D):
Count = D[class_name].value_counts()
Max, Max_key = -1, None
for key, value in Count.items():
if value > Max:
Max = value
Max_key = key
return D_keys[class_name][Max_key]
# 函数TreeGenerate 递归生成决策树,以下情形导致递归返回
# 1. 当前结点包含的样本全属于一个类别
# 2. 当前属性值为空, 或是所有样本在所有属性值上取值相同,无法划分
# 3. 当前结点包含的样本集合为空,不可划分
def TreeGenerate(D):
Count = D[class_name].value_counts()
if len(Count) == 1:
key = Count.keys()[0]
return D_keys[class_name][key]
if same_value(D):
return choose_largest_example(D)
x = np.array(D[names])
y = np.array(D[class_name])
beta = run(x, y, 1)
node, Dvs_index = {}, [[], []]
for i, xi, yi in zip(D['编号'].values, x, y):
p1 =p1_function(xi, beta)
# bad example
if p1 < 0.5:
Dvs_index[0].append(i)
# good example
else:
Dvs_index[1].append(i)
flag = max([len(Dv_index) for Dv_index in Dvs_index]) < D.shape[0]
for index, Dv_index in enumerate(Dvs_index):
Dv = D.loc[D['编号'].isin(Dv_index)]
if flag:
node[index] = TreeGenerate(Dv)
else:
node[index] = choose_largest_example(D)
# plotTree.plotTree(Tree)
l = tuple(value[0] for value in beta)
return {l: node}
if __name__ == '__main__':
# 读取数据
filename = 'data_3.txt'
dataSet = loadData(filename)
processData(dataSet)
Tree = TreeGenerate(dataSet)
print(Tree)
plotTree.createPlot(Tree)
代码参考网上大佬的,里面用到了一个plotTree的决策树绘图模块函数代码我会附在下面,大家也可以去网上找很容易就找到了。
我运行的时候只输出了决策树的字典结构,因为绘图很卡,没显示出来,不知道是硬件问题还是。。。。
下面是运行结果:
附plotTree代码:
import matplotlib.pyplot as plt
from pylab import *
# 定义文本框和箭头格式
decisionNode = dict(boxstyle="sawtooth", fc="0.8")
leafNode = dict(boxstyle="round4", fc="0.8")
arrow_args = dict(arrowstyle="<-")
mpl.rcParams['font.sans-serif'] = ['SimHei'] #指定默认字体
mpl.rcParams['axes.unicode_minus'] = False #解决保存图像是负号'-'显示为方块的问题
def plotMidText(cntrPt, parentPt, txtString):
xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]
yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
createPlot.ax1.text(xMid, yMid, txtString)
def plotNode(nodeTxt, centerPt, parentPt, nodeType): # 绘制带箭头的注解
createPlot.ax1.annotate(nodeTxt, xy=parentPt, xycoords="axes fraction", xytext=centerPt, textcoords="axes fraction", va="center", ha="center", bbox=nodeType, arrowprops=arrow_args)
def getNumLeafs(myTree): # 获取叶节点的数目
numLeafs = 0
firstStr = list(myTree.keys())[0]
secondDict = myTree[firstStr]
for key in secondDict.keys():
if type(secondDict[key]).__name__ == 'dict':
numLeafs += getNumLeafs(secondDict[key])
else: numLeafs += 1
return numLeafs
def getTreeDepth(myTree): # 获取树的层数
maxDepth = 0
firstStr = list(myTree.keys())[0]
secondDict = myTree[firstStr]
for key in secondDict.keys():
if type(secondDict[key]).__name__ == 'dict':
thisDepth = 1 + getTreeDepth(secondDict[key])
else: thisDepth = 1
if thisDepth > maxDepth: maxDepth = thisDepth
return maxDepth
def retrieveTree(i): # 获取预定义的树
listOfTrees =[{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}},
{'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}
]
return listOfTrees[i]
def plotTree(myTree, parentPt, nodeTxt):
numLeafs = getNumLeafs(myTree)
getTreeDepth(myTree)
firstStr = list(myTree.keys())[0]
cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW,\
plotTree.yOff)
plotMidText(cntrPt, parentPt, nodeTxt)
plotNode(firstStr, cntrPt, parentPt, decisionNode)
secondDict = myTree[firstStr]
plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
for key in secondDict.keys():
if type(secondDict[key]).__name__=='dict':
plotTree(secondDict[key],cntrPt,str(key))
else:
plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff),
cntrPt, leafNode)
plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD
def createPlot(inTree):
fig = plt.figure(1, figsize=(600, 30), facecolor='white')
fig.clf()
axprops = dict(xticks=[], yticks=[])
createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)
plotTree.totalW = float(getNumLeafs(inTree))
plotTree.totalD = float(getTreeDepth(inTree))
plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;
plotTree(inTree, (0.5,1.0), '')
plt.show()
4.6
代码是参考了网上大佬的,该代码只利用数据集创建了三种不同算法产生的决策树,并进行了预剪枝和卖药剪枝准确率的比较。
算法中设计了一个函数对数据集进行分类,但是那个函数有些地方我不是很懂,所以被我注释了,试着用sklearn中的随机划分训练集和测试集的方法进行划分,运行后结果相比之前,未剪枝的准率略微下降,但是剪枝后的效果反而略微提高了一点。
数据是从UCI数据官网:http://archive.ics.uci.edu/ml/index.php下载的,选取最受欢迎的Iris数据集:
http://archive.ics.uci.edu/ml/machine-learning-databases/iris/
import numpy as np
import pandas as pd
from sklearn import tree
from sklearn import linear_model
from sklearn import metrics
from sklearn.model_selection import train_test_split
names = ['sepal length', 'sepal width', 'petal length', 'petal width', 'class']
#读取数据并返回
def loadData(filename):
dataSet = pd.read_csv(filename, names=names)
return dataSet
#创建决策树的方法
def run(data_train, data_test, criterion='gini', min_impurity_decrease=0):
train_data, train_target = data_train[names[:-1]], data_train[names[-1]]
test_data, test_target = data_test[names[:-1]], data_test[names[-1]]
#criterion: 选择决策树划分选择,'gini', 'entropy', 'LogisticRegression'
#min_impurity_decrease: 选择是否预剪枝,默认值为0,即不剪枝
# 对数回归决策树
if criterion == 'LogisticRegression':
max_iter = 100 - 900*min_impurity_decrease
clf = linear_model.LogisticRegression(solver='lbfgs', multi_class='multinomial', max_iter=max_iter)
clf = clf.fit(train_data, train_target)
else:
clf = tree.DecisionTreeClassifier(criterion=criterion, min_impurity_decrease=min_impurity_decrease)
clf = clf.fit(train_data, train_target)
predict_target = clf.predict(test_data)
# Accuracy
acc = metrics.accuracy_score(test_target, predict_target)
print('Accuracy:', acc)def run(data_train, data_test, criterion='gini', min_impurity_decrease=0):
train_data, train_target = data_train[names[:-1]], data_train[names[-1]]
test_data, test_target = data_test[names[:-1]], data_test[names[-1]]
#criterion: 选择决策树划分选择,'gini', 'entropy', 'LogisticRegression'
#min_impurity_decrease: 选择是否预剪枝,默认值为0,即不剪枝
# 对数回归决策树
if criterion == 'LogisticRegression':
max_iter = 100 - 900*min_impurity_decrease
clf = linear_model.LogisticRegression(solver='lbfgs', multi_class='multinomial', max_iter=max_iter)
clf = clf.fit(train_data, train_target)
else:
clf = tree.DecisionTreeClassifier(criterion=criterion, min_impurity_decrease=min_impurity_decrease)
clf = clf.fit(train_data, train_target)
predict_target = clf.predict(test_data)
# Accuracy
acc = metrics.accuracy_score(test_target, predict_target)
print('Accuracy:', acc)
if __name__ == '__main__':
filename = 'iris.data'
dataSet = loadData(filename)
#D = processData(dataSet, 3)
data_train, data_test =train_test_split(dataSet,test_size=0.5,random_state=1)
#D[0].append(D[1]), D[2]
# criteria are “gini” for the Gini impurity
# “entropy” for the information gain.
# 'LogisticRegression' for the Logistic Regression
kinds = ['gini', 'entropy', 'LogisticRegression']
prunes = [0, 0.1]
#通过循环训练了三种算法产生的决策树
for kind in kinds:
for pre_prune in prunes:
print(kind, pre_prune, end=' ')
run(data_train, data_test, kind, pre_prune)
代码运行结果:
4.7
队列的特点是先进先出。
划分时当该点的所有样本属于同一类或属性完全相同时则为结点。
当达到最大深度时则停止划分。
剩下的涉及到知识盲区了。。。。。
看了网上大佬的算法,但是里面也运用到了栈。
----
输入: 训练集 D = {(x1,y1),(x2,y2),...,(xm,ym)}.
属性集 A = {a1, a2,...,ad}.
过程: 函数 TreeGenerate(D,A):
1. 生成根节点 root;
2. 初始化深度 depth = 0;
3. 生成栈 stack (为保存顶节点root及其对应的数据D和深度depth);
4.
5. while D != Φ OR stack不为空:
6. if D != Φ, then
7. if D中样本全属于同一类别C, then
8. root标记为C类叶节点, D = Φ, continue;
9. end if
10. if depth == MaxDepth OR D中样本在属性A上取值相同, then
11. root标记为D取值中最多类的叶节点, D = Φ, continue;
12. end if
13. 从A中选择最优划分属性a*, 令Dv表示D中在a*上取值为a*v的样本子集;
14. 生成子节点 child, 为root建立分支指向child;
15. 将[root, D\{Dv}, A, depth]压入栈stack;
16. 令 root = child, D = Dv, A = A\{a*}, depth = depth+1;
17. else
18. 从stack中弹出[root, D, A, depth];
19. end if
输出: 树的根节点root.(即以root为根节点的树)
----
4.8
对于深度优先搜索:每深入一层需要存储上一层结点的信息以方便回溯遍历。(储存的是一条路径)
对于广度优先搜索:每深入一层需要存储当前层兄弟结点信息以实现遍历.。(储存的是每层信息,储存量大一些)
4.9
将这三个公式代入gini指数中即可
4.10
https://blog.csdn.net/icefire_tyh/article/details/52082051
参考文章:
https://blog.csdn.net/weixin_43759518/article/details/105893107
https://blog.csdn.net/lzaya0000/article/details/104631182
https://blog.csdn.net/weixin_37922777/article/details/88838130
https://blog.csdn.net/weixin_37922777/article/details/88832147
https://blog.csdn.net/icefire_tyh/article/details/52082054