正文

机器学习——决策树（下）算法实现

Lyndon_zheng  Lyndon_zheng  2022-12-09  200

关键词：

Decision tree

在机器学习（5）——决策树（上）原理中介绍了决策树的生成和剪枝原理。介绍了CART,ID3，C4.5等算法的算法流程，其中CART算法可以实现回归和分类，是基于基尼不纯度实现的，这里并未实现。这里主要实现了ID3和C4.5算法，是基于信息熵的，在本处因为没有涉及剪枝，他们最终得到的结果都是一样的。我们先来看ID3的整个算法框架（C4.5也基本类似，不同之处是特征选取的区别）：

Algotithm 4.1 ID3(D)
Input: an attribute-valued dataset $D$
Output: a decision tree
1. if $D$ is “pure” OR Attribute is null then
2. return class
3. end if
4. for all attribute $a\\in D$ do
5. computer the imformation gain and select best feature
6. end for
7. $a_best=$ Best attribute feature
8. $Tree=$ Create a decision node that feature $a_best$ in root
9. $D_v=$ Induced sub-dataset for feature $a_best$
10. for all $D_v$ do
11. $Tree_v=ID3(D_v)%$
12. end for
13. return Tree

算法实现

(1)创建训练数据集：
从.txt文件中读取数据，并去掉空格，分割数据，最终返回dataset数据集合attribute特征类别。

# process training data set
# input: directory
# output: data_set, attribute

def proData(path):
    fileset = open(path)   #loading data file
    dataset = [data.strip().split('\\t') for data in fileset.readlines()]
    attribute = dataset [0]
    del(dataset[0])
    return dataset,attribute

(2)计算信息熵：
先统计训练数据的总量，然后统计每个标签类别的数目，得到其概率，最后计算信息熵

H(X)=−∑i=1npilogpi $H(X)=-\\sum_i=1^np_i\\log p_i$

# calculate the information entropy
# input: dataset
# output: entropy

def calcEntropy(dataset):
    numEntries = len (dataset)
    attributeCounts = 
    for item in dataset:
        currentAttribute = item[-1]
        if currentAttribute not in attributeCounts.keys():
            attributeCounts[currentAttribute]=0
        attributeCounts[currentAttribute]+=1
    entropy = 0.0
    for key in attributeCounts:
        prob = float (attributeCounts[key])/numEntries
        entropy -= prob *log(prob,2)
    return entropy

(3)划分子数据集：
选取最好的分类特征之后，依据该特征得到新的子训练样本，将子样本进行归类，并去掉本次已选的属性（特征）。

# split data based on different values of attribute
# input: dataset
# output: split data 
def splitData(dataset,axis,value):
    splitdata = [] 
    for feature in dataset:
        if feature[axis] == value:
            #del(feature[axis])
            tempFeaVec = feature[:axis]
            tempFeaVec.extend(feature[axis+1:])
            splitdata.append(tempFeaVec)
    return splitdata

(4)选取最好特征：
在ID3算法中，依据信息增益选取最好的特征，在C4.5中依据信息增益比选取最好特征。
ID3：信息增益

# calculate the entropy of different features
# input: dataset
# output: best feature
def selectBestFeature(dataset):
    numFeatures = len(dataset[0]) - 1
    baseEntropy = calcEntropy(dataset)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):
        featList = [features[i] for features in dataset] # Select attribute types
        uniqueVals = set(featList)                       # Set different values of same attribute
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitData(dataset, i, value)
            prob = float(len(subDataSet))/len(dataset)
            newEntropy += prob * calcEntropy(subDataSet) 
        infoGain = baseEntropy - newEntropy
        if (infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

C4.5：信息增益率

# calculate the information gain ratio for different features
# input: dataset
# output: best feature
def selectBestFeature_C4(dataset):
    numFeatures = len(dataset[0]) - 1
    baseEntropy = calcEntropy(dataset)
    bestInfoGainRatio = 0.0; bestFeature = -1
    for i in range(numFeatures):
        featList = [features[i] for features in dataset] # Select attribute types
        uniqueVals = set(featList)                       # Set different values of same attribute
        newEntropy = 0.0;Splitentropy = 0.0
        for value in uniqueVals:
            subDataSet = splitData(dataset, i, value)
            prob = float(len(subDataSet))/len(dataset)
            newEntropy += prob * calcEntropy(subDataSet) 
            Splitentropy -= prob *log(prob,2)
        infoGainRatio = (baseEntropy - newEntropy)/Splitentropy
        if (infoGainRatio > bestInfoGainRatio):
            bestInfoGainRatio = infoGainRatio
            bestFeature = i
    return bestFeature

(5)创建决策树：
首先计算所有属性（特征）对于原经验熵的信息增益（率），据此选取出最好的属性（特征），然后根据所选的最好属性（特征）将原数据集分成不同的子数据集，并迭代计算子数据集的树，直到子数据集不可分或属性集合为空为止。
ID3决策树生成

# train decision tree ID3
# input: dataset, attribute
# output: decision tree
def createTreeID3(dataset,attributes):
    classList = [example[-1] for example in dataset]
    classCount = 
    if classList.count(classList[0]) ==len(classList):
        return classList[0]                             # stop splitting when all data belong to same labels
    if len(dataset[0]) == 1:                            # stop splitting when attribute = NULL, return the max class
        for value in classList:
            if value not in classCount.keys():
                classCount[value] = 0
            classCount[value]+=1
        sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse = True)
        return sortedClassCount[0][0] 
    bestFeature = selectBestFeature(dataset)
    bestAttribute = attributes [bestFeature]
    myTree = bestAttribute:
    del(attributes [bestFeature])
    featureValues = [example[bestFeature] for example in dataset] # select the training data of the child node
    uniqueVals = set(featureValues)
    for value in uniqueVals:
        subattributes = attributes[:]
        myTree[bestAttribute][value] = createTreeID3(splitData(dataset, bestFeature, value), subattributes)
    return myTree

C4.5决策树生成

# train decision tree C4.5
# input: dataset, attribute
# output: decision tree
def createTreeC4(dataset,attributes):
    classList = [example[-1] for example in dataset]
    classCount = 
    if classList.count(classList[0]) == len(classList):
        return classList[0]
    if len(dataset[0]) == 1:
        for value in classList: 
            if value not in classCount.keys():
                classCount[value] = 0
            classCount[value] +=1
        sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse = True)
        return sortedClassCount[0][0] 
    bestFeature = selectBestFeature_C4(dataset)
    bestAttribute = attributes [bestFeature]
    myTree = bestAttribute:
    del(attributes [bestFeature])
    featureValues = [example[bestFeature] for example in dataset] # select the training data of the child node
    uniqueVals = set(featureValues)
    for value in uniqueVals:
        subattributes = attributes[:]
        myTree[bestAttribute][value] = createTreeC4(splitData(dataset, bestFeature, value), subattributes)
    return myTree

(6)主函数：
给定数据所在位置，并输出最终的效果。

# main function
if __name__=="__main__":
    # data_set processing
    dataset = []
    attributes = []
    path='F:\\Program\\Python\\Machine_Learning\\Decision_tree\\lenses.txt'
    dataset,attributes = proData(path)
    myTreeID3 = createTreeID3(dataset,attributes)
    dataset,attributes = proData(path)
    myTreeC4 = createTreeC4(dataset, attributes)
    print str(myTreeID3)
    createPlot(myTreeID3)
    print str(myTreeC4)
    createPlot(myTreeC4)

(7)画图函数：
生成的决策树通过文本形式观看不是很直观，设计一个画图子函数之后，可以很直观的将生成的决策树打印出来，看到不错的效果。

# Project: Machine learning-decision tree
# Author: Lyndon
# date: 2015/10/27

from matplotlib import pyplot as plt

# define the format of text and arrow
decisionNode = dict(boxstyle ="sawtooth",fc = "0.8")
leafNode = dict(boxstyle = "round4", fc = "0.8")
arrowRrgs = dict (arrowstyle = "<-")

# calculate the number of tree leaves and the depth of tree 
# input: decision tree
# output: numbers of node, depth of the tree
def calNumLeaves(tree):
    numLeaves = 0
    maxDepth = 0
    firstNode = tree.keys()[0]
    secondDict = tree[firstNode]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__ == 'dict':        #check if the node is leaf
            subnumLeaves,submaxDepth = calNumLeaves(secondDict[key])
            numLeaves += subnumLeaves
            thisDepth = 1 +submaxDepth
        else: 
            numLeaves +=1
            thisDepth = 1
        if thisDepth > maxDepth:
            maxDepth = thisDepth
    return numLeaves,maxDepth

# plot the node and leaf
# input: node,leaf, center, parent,   
# output: null
def plotsubtree(node,text,center,parent,nodeType):
    createPlot.ax1.annotate(node,xy=parent,xycoords='axes fraction',
                            xytext=center,textcoords='axes fraction',
                            va='center',ha='center',bbox=nodeType,arrowprops=arrowRrgs)
    xMid = (parent[0]-center[0])/2.0+center[0]
    yMid = (parent[1]-center[1])/2.0+center[1]
    createPlot.ax1.text(xMid,yMid,text,va='center',ha='center',rotation=30)

# plot the tree
# input: tree
# output: null
def plotTree(tree,parent,nodetxt):
    numLeaves, depth = calNumLeaves(tree)
    firstNode = tree.keys()[0]
    center = (plotTree.xOff+(1+float(numLeaves))/2.0/plotTree.num,plotTree.yOff )
    plotsubtree(firstNode, nodetxt, center, parent, decisionNode)
    secondDict = tree[firstNode]
    plotTree.yOff -=1.0/plotTree.depth 
    for key in secondDict.keys():
        if type(secondDict[key]).__name__ == 'dict': 
            plotTree(secondDict[key], center, str(key))
        else:
            plotTree.xOff += 1.0/plotTree.num
            plotsubtree(secondDict[key], str(key), (plotTree.xOff,plotTree.yOff), center, leafNode)
    plotTree.yOff += 1.0/plotTree.depth

# plot the Tree
# input: Tree
# output: Null
def createPlot(tree):
    fig = plt.figure(1,facecolor='white')
    fig.clf()
    axprops = dict(xticks=[],yticks=[])
    createPlot.ax1 = plt.subplot(111,frameon=False,**axprops) 
    plotTree.num, plotTree.depth = calNumLeaves(tree)
    plotTree.xOff = -0.5/plotTree.num; plotTree.yOff = 1.0
    plotTree(tree,(0.5,1.0),'')
    plt.show()

(8)分类结果：
决策树文本输出：
‘tearRate’‘reduced’: ‘no lenses’, ‘normal’: ‘astigmatic’: ‘yes’: ‘prescriptor’: ‘hyper’: ‘age’: ‘pre’: ‘no lenses’, ‘presbyopic’: ‘no lenses’, ‘young’: ‘hard’, ‘myope’: ‘hard’, ‘no’: ‘age’: ‘pre’: ‘soft’, ‘presbyopic’: ‘prescriptor’: ‘hyper’: ‘soft’, ‘myope’: ‘no lenses’, ‘young’: ‘soft’
决策树图示：

在本例中，没有剪枝过程，ID3和C4.5算法实现的最终结果一样。
PS：
本文主要通过Python实现了决策树中的ID3和C4.5算法，只是简单的应用了信息增益和信息增益率来实现分类，代码参考了《机器学习实战》，完整代码及数据。

机器学习--决策树分类算法及应用

1.决策树分类算法原理1.1概述决策树（decisiontree）——是一种被广泛使用的分类算法。相比贝叶斯算法，决策树的优势在于构造过程不需要任何领域知识或参数设置在实际应用中，对于探测式的知识发现，决策树更加适用1.2算法... 查看详情

ai机器学习-决策树算法-概念和学习过程

1.概念决策树是通过一系列规则对数据进行分类的过程，它提供一种在什么条件下会得到什么值的类似规则的方法。决策树分为分类树和回归树两种，分类树对离散变量做决策树，回归树对连续变量做决策树。分类决策树模型是... 查看详情

机器学习算法之决策树（上）

信息熵决策树决策树优化剪枝决策树可视化决策树直观理解比特化(Bits) 查看详情

决策树算法

　决策树算法在机器学习中算是很经典的一个算法系列了。它既可以作为分类算法，也可以作为回归算法，同时也特别适合集成学习比如随机森林。本文就对决策树算法原理做一个总结，上篇对ID3，C4.5的算法思想做了总结，下... 查看详情

《机器学习》（周志华）第4章决策树笔记理论及实现——“西瓜树”——cart决策树

CART决策树（一）《机器学习》（周志华）第4章决策树笔记理论及实现——“西瓜树”参照上一篇ID3算法实现的决策树（点击上面链接直达），进一步实现CART决策树。其实只需要改动很小的一部分就可以了，把原先计算信息熵和... 查看详情

常见机器学习算法原理+实践系列4（决策树）

决策树分类决策树算法借助于树的分支结构实现分类，决策树在选择分裂点的时候，总是选择最好的属性作为分类属性，即让每个分支的记录的类别尽可能纯。常用的属性选择方法有信息增益（InformationGain），增益比例（gainratio... 查看详情

机器学习---算法---决策树

.../blog.csdn.net/qq_43208303/article/details/84837412 决策树是一种机器学习的方法。决策树的生成算法有ID3,C4.5和CART等。决策树是一种树形结构，其中每个内部节点表示一个属性上的判断，每个分支代表一个判断结果的输出，最后每个叶... 查看详情

机器学习决策树算法泰坦尼克号乘客生存预测(代码片段)

目录1决策树算法api2泰坦尼克号乘客案例背景2.1步骤分析2.2代码实现2.3决策树可视化2.3.1保存树的结构到dot文件2.3.2网站显示结构3决策树总结4小结1决策树算法apiclasssklearn.tree.DecisionTreeClassifier(criterion=’gini’,max_depth=None,rando... 查看详情

机器学习决策树算法泰坦尼克号乘客生存预测(代码片段)

机器学习算法学习02：决策树的学习以及应用决策树解决cora数据集论文分类问题(代码片段)

机器学习算法学习02：决策树的学习以及应用决策树解决Cora数据集论文分类问题文章目录机器学习算法学习02：决策树的学习以及应用决策树解决Cora数据集论文分类问题1.前言2.算法分析2.1算法概述2.2算法优化3.算法代码3.... 查看详情

机器学习基于决策树的隐性眼镜选择(代码片段)

实验介绍1.实验内容本实验学习并实现决策树算法。2.实验目标通过本实验掌握决策树算法的基本原理。3.实验知识点香农熵信息增益4.实验环境python3.6.55.预备知识Python编程基础准备工作点击屏幕右上方的下载实验数据模块&#x... 查看详情

机器学习决策树

1、决策树简介1.1决策树概述决策树算法是一种基于树形结构的分类算法，它能从给定的无序的训练样本中，提炼出树型的分类模型，树形中包含判断模块和终止模块。它是一种典型的分类算法，首先对数据进行处理，利用归纳... 查看详情

pyhon3实现机器学习经典算法id3决策树(代码片段)

一、ID3决策树概述　　ID3决策树是另一种非常重要的用来处理分类问题的结构，它形似一个嵌套N层的IF…ELSE结构，但是它的判断标准不再是一个关系表达式，而是对应的模块的信息增益。它通过信息增益的大小，从根节点开始... 查看详情

id3决策树算法|机器学习(代码片段)

目录1.ID3决策树原理1.1基本原理1.2信息熵1.3条件熵1.4信息增益2.代码实现2.1计算信息熵calEnt2.2获得数据子集splitdataset2.3获得最优特征索引2.4处理样本中只有一个特征或者特征都一样的情况2.5创建ID3决策树2.6返回一条测试数据的标... 查看详情

机器学习算法整理决策树

决策树的训练与测试如何切分特征（选择节点）衡量标准-熵信息增益决策树构造实例信息增益：表示特征X使得类Y的不确定性减小的程度。（分类后的专一性，希望分类后的结果是同类在一起）Outlook=sunny时，熵值=(-2/5)*log(2/... 查看详情

机器学习算法：决策树算法api

学习目标知道决策树算法api的具体使用classsklearn.tree.DecisionTreeClassifier(criterion=’gini’,max_depth=None,random_state=None)criterion特征选择标准"gini"或者"entropy"，前者代表基尼系数，后者代表信息查看详情

机器学习实战精读--------决策树

感觉自己像个学走路的孩子，每一步都很吃力和认真！机器根据数据集创建规则，就是机器学习。决策树：从数据集合中提取一系列规则，适用于探索式的知识发现。决策树本质：通过一系列规则对数据进行分类的过程。决策树... 查看详情

机器学习算法决策树-6prism

..._CSDN博客-数据分析师领域博主目前进度：第四部分【机器学习算法】PRISM决策规则算法如何使用分类树来进行分类预测：如果我们建立好决策树，那我们要怎么进行分类规则的预测呢。一般有两种方法。第查看详情