正文

修剪 sklearn 决策树以确保单调性

 2023-03-12  262

关键词：

【中文标题】修剪 sklearn 决策树以确保单调性【英文标题】：Prune sklearn decision tree to ensure monotony 【发布时间】：2021-10-01 00:36:38 【问题描述】：

我需要修剪 sklearn 决策树分类器，以使指示的概率（图像右侧的值）单调增加。例如，如果你用 python 编写一个基本的树，你有：

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.tree._tree import TREE_LEAF
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
 
iris = load_iris()
X, y = iris.data[:, 0].reshape(-1,1), np.where(iris.target==0,0,1)
tree = DecisionTreeClassifier(max_depth=3, random_state=123)
tree.fit(X,y)
percentages = tree.tree_.value[:,0,1]/np.sum(tree.tree_.value.reshape(-1,2), axis=1)

现在必须消除不遵循单调的叶子，如所示。

剩余如下：

虽然所示示例没有显示，但要考虑的规则是如果叶子有不同的父节点，则保留数据量最大的叶子。为了解决这个问题，我一直在尝试做一个蛮力算法，但它只执行第一次迭代，我需要将算法应用于更大的树。答案可能是使用递归，但是用sklearn树结构，我真的不知道该怎么做。

【问题讨论】：

【参考方案1】：

执行以下操作可以满足您建议的修剪要求：遍历树，识别非单调叶子，每次删除具有最少成员的父节点的非单调叶子并重复此操作，直到叶子之间的单调性为持续。尽管这种每次移除一个节点的方法增加了时间复杂度，但树通常具有有限的深度。会议论文"Pruning for Monotone Classification Trees" 帮助我理解了树木的单调性。然后我推导出这种方法来维持你的场景。

由于需要从左到右识别非单调叶子，所以第一步是post-order traverse the tree。如果您不熟悉树遍历，这是完全正常的。我建议在了解功能之前通过互联网资源学习来了解它的机制。你可以运行遍历函数来查看它的发现。实际输出会帮助你理解。

#We will define a traversal algorithm which will scan the nodes and leaves from left to right
#The traversal is recursive, we declare global lists to collect values from each recursion
traversal=[] #List to collect traversal steps
parents=[]#List to collect the parents of the collected nodes or leaves
is_leaves=[] #List to collect if the collected traversal item are leaves or not
# A function to do postorder tree traversal
def postOrderTraversal(tree,root,parent):
    if root!=-1:
        #Recursion on left child
        postOrderTraversal(tree,tree.tree_.children_left[root],root)
        #Recursion on right child
        postOrderTraversal(tree,tree.tree_.children_right[root],root)  
        traversal.append(root) #Collect the name of node or leaf
        parents.append(parent) #Collect the parent of the collected node or leaf
        is_leaves.append(is_leaf(tree,root)) #Collect if the collected object is leaf

上面，我们用递归调用节点的左右子节点，这是通过decision tree structure 提供的方法。使用的is_leaf() 是一个辅助函数，如下所示。

def is_leaf(tree,node):
  if tree.tree_.children_left[node]==-1:
    return True
  else:
    return False

决策树节点总是有两个叶子。因此，仅检查左孩子的存在会产生有关对象是节点还是叶的信息。如果所询问的孩子不存在，则树返回 -1。

由于您已定义非单调性条件，因此需要叶内 1 类的比率。我称之为positive_ratio()（这就是你所说的“百分比”。）

def positive_ratio(tree): #The frequency of 1 values of leaves in binary classification tree: 
  #Number of samples with value 1 in leaves/total number of samples in nodes/leaves
  return tree.tree_.value[:,0,1]/np.sum(tree.tree_.value.reshape(-1,2), axis=1)

下面的最后一个辅助函数返回具有最少样本数的节点（1、2、3 等）的树索引。此功能需要其叶子表现出非单调行为的节点列表。我们在这个辅助函数中调用树结构的n_node_samples 属性。找到的节点就是要移除其叶子的节点。

def min_samples_node(tree, nodes): #Finds the node with the minimum number of samples among the provided list
  #Make a dictionary of number of samples of given nodes, and their index in the nodes list
  samples_dict=tree.tree_.n_node_samples[node]:i for i,node in enumerate(nodes)
  min_samples=min(samples_dict.keys()) #The minimum number of samples among the samples of nodes
  i_min=samples_dict[min_samples] #Index of the node with minimum number of samples
  return nodes[i_min] #The number of node with the minimum number of samples

在定义了辅助函数之后，执行修剪的包装函数迭代直到树的单调性得以维持。返回所需的单调树。

def prune_nonmonotonic(tree): #Prune non-monotonic nodes of a binary classification tree
  while True: #Repeat until monotonicity is sustained
    #Clear the traversal lists for a new scan
    traversal.clear()
    parents.clear()
    is_leaves.clear()
    #Do a post-order traversal of tree so that the leaves will be returned in order from left to right
    postOrderTraversal(tree,0,None)
    #Filter the traversal outputs by keeping only leaves and leaving out the nodes
    leaves=[traversal[i] for i,leaf in enumerate(is_leaves) if leaf == True]
    leaves_parents=[parents[i] for i,leaf in enumerate(is_leaves) if leaf == True]
    pos_ratio=positive_ratio(tree) #List of positive samples ratio of the nodes of binary classification tree
    leaves_pos_ratio=[pos_ratio[i] for i in leaves] #List of positive samples ratio of the traversed leaves
    #Detect the non-monotonic pairs by comparing the leaves side-by-side
    nonmonotone_pairs=[[leaves[i],leaves[i+1]] for i,ratio in enumerate(leaves_pos_ratio[:-1]) if (ratio>=leaves_pos_ratio[i+1])]
    #Make a flattened and unique list of leaves out of pairs
    nonmonotone_leaves=[]
    for pair in nonmonotone_pairs:
      for leaf in pair:
        if leaf not in nonmonotone_leaves:
          nonmonotone_leaves.append(leaf)
    if len(nonmonotone_leaves)==0: #If all leaves show monotonic properties, then break
      break
    #List the parent nodes of the non-monotonic leaves
    nonmonotone_leaves_parents=[leaves_parents[i] for i in [leaves.index(leave) for leave in nonmonotone_leaves]]
    node_min=min_samples_node(tree, nonmonotone_leaves_parents) #The node with minimum number of samples
    #Prune the tree by removing the children of the detected non-monotonic and lowest number of samples node
    tree.tree_.children_left[node_min]=-1
    tree.tree_.children_right[node_min]=-1
  return tree

所有包含“while”的循环一直持续到遍历的叶子不再表现出非单调性的迭代。 min_samples_node() 标识包含非单调叶子的节点，它是同类中最低的成员。当它的左右子节点被值“-1”替换时，树被修剪，下一次“while”迭代将产生完全不同的树遍历，以识别并去除剩余的非单调性。

下图分别显示了未修剪和已修剪的树。

【讨论】：

很好的答案，我最终做了类似的事情，用床单和他们各自的父母填写了一个清单。感谢您的帮助:)

决策树 sklearn：网球数据集

】决策树sklearn：网球数据集【英文标题】：DecisionTreesklearn:PlayTennisDataSet【发布时间】：2018-06-2306:16:39【问题描述】：我正在练习将sklearn用于决策树，我正在使用打网球数据集play_是目标列。根据我的纸笔计算熵和信息增益，根... 查看详情

在 sklearn DecisionTreeClassifier 中修剪不必要的叶子

】在sklearnDecisionTreeClassifier中修剪不必要的叶子【英文标题】：PruneunnecessaryleavesinsklearnDecisionTreeClassifier【发布时间】：2018-12-2602:01:04【问题描述】：我使用sklearn.tree.DecisionTreeClassifier来构建决策树。使用最优参数设置，我得到一... 查看详情

决策树修剪的效果

】决策树修剪的效果【英文标题】：TheeffectofDecisionTreePruning【发布时间】：2011-04-2821:13:29【问题描述】：我想知道我是否从训练和验证集构建了一个类似ID3的决策树A，但A未修剪。同时，我在ID3中也有另一个决策树B，它是从相同... 查看详情

使用 rpart 为决策树修剪选择 CP 值

】使用rpart为决策树修剪选择CP值【英文标题】：SelectingCPvaluefordecisiontreepruningusingrpart【发布时间】：2016-10-0919:35:48【问题描述】：我了解选择CP值的常见做法是选择具有最小xerror值的最低级别。但是，在我的以下情况下，使用cp&... 查看详情

再探决策树算法之利用sklearn进行决策树实战

sklearn模块提供了决策树的解决方案，不用自己去造轮子了（不会造，感觉略复杂）：下面是笔记：Sklearn.tree参数介绍及使用建议参数介绍及使用建议官网：http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.htmlclasss... 查看详情

sklearn实践：决策树(代码片段)

sklearn实践（二）：决策树一、数据处理继续上次聚类的练习，基于稍作处理的数据，在决策树中，只需划分一下训练集和测试集即可这里用到的是sklearn.model_selection.train_test_split()函数原型：sklearn.model_selection.``train_test_split(**array... 查看详情

决策树过拟合检验

...题描述】：我目前正在处理容易过拟合的数据，因此我在sklearn上读到max_depth通常是树过拟合的原因时，通过测试每个深度的roc_auc分数来制作函数。但我不确定我的想法是否正确这里有我的结果图片：我也尝试使用后修剪方法，... 查看详情

sklearn决策树的BFS遍历

】sklearn决策树的BFS遍历【英文标题】：BFStraversalofsklearndecisiontree【发布时间】：2020-08-0300:12:29【问题描述】：如何进行sklearn决策树的广度优先搜索遍历？在我的代码中，我尝试了sklearn.tree_库并使用了各种函数，例如tree_.feature... 查看详情

我们可以选择在 sklearn 中使用啥决策树算法吗？

】我们可以选择在sklearn中使用啥决策树算法吗？【英文标题】：CanwechoosewhatDecisionTreealgorithmtouseinsklearn?我们可以选择在sklearn中使用什么决策树算法吗？【发布时间】：2016-03-1718:02:09【问题描述】：我的问题是我们可以选择在skle... 查看详情

决策树中特定类的 Sklearn 决策规则

】决策树中特定类的Sklearn决策规则【英文标题】：SklearnDecisionRulesforSpecificClassinDecisiontree【发布时间】：2019-03-0409:57:19【问题描述】：我正在创建决策树。我的数据属于以下类型X1|X2|X3|.....X50|Y_____________________________________1|5|7|....... 查看详情

SkLearn 的决策树：过度拟合还是错误？

】SkLearn的决策树：过度拟合还是错误？【英文标题】：DecisionTreeOfSkLearn:OverfittingorBug?【发布时间】：2014-07-1916:46:31【问题描述】：我正在使用sklearn的tree包分析我的决策树模型的训练误差和验证误差。#computethermserrordefcompute_error(... 查看详情

决策树sklearn：预测准确率100%

】决策树sklearn：预测准确率100%【英文标题】：decisiontreesklearn:predictionaccuracy100%【发布时间】：2017-05-3102:41:58【问题描述】：我的决策树分类器准确度为1.0，决策树输出中只有一个节点，混淆矩阵中也只有一个元素。随机森林也... 查看详情

决策树唯一性sklearn

】决策树唯一性sklearn【英文标题】：DecisionTreeUniquenesssklearn【发布时间】：2018-05-2402:14:05【问题描述】：我有一些关于决策树和随机森林分类器的问题。问题1：经过训练的决策树是否独一无二？我认为它应该是独一无二的，因... 查看详情

决策树（decisiontree）sklearn

#!/usr/bin/envpython#-*-coding:utf-8-*-fromsklearn.feature_extractionimportDictVectorizerimportcsvfromsklearnimporttreefromsklearnimportpreprocessingfromsklearn.externals.siximportStringIO#Readinthecs 查看详情

sklearn1.分类决策树(代码片段)

...，本篇博文不注重相关数学原理，主要注重使用sklearn实现分类树的效果。参考课程见【2020机器学习全集】菜菜的sklearn完整版决策树简介决策树（DecisionTree）是一种非参数的有监督学习方法查看详情

决策树 Sklearn - 树的深度和准确性

】决策树Sklearn-树的深度和准确性【英文标题】：DecisionTreeSklearn-DepthOftreeandaccuracy【发布时间】：2018-08-2315:47:42【问题描述】：我正在使用sklearn将决策树应用于数据集在Sklearn中有一个参数可以选择树的深度-dtree=DecisionTreeClassifier... 查看详情

sklearn中的交叉验证+决策树

】sklearn中的交叉验证+决策树【英文标题】：crossvalidation+decisiontreesinsklearn【发布时间】：2016-05-0722:12:03【问题描述】：尝试使用sklearn和panads创建具有交叉验证的决策树。我的问题在下面的代码中，交叉验证拆分数据，然后我将... 查看详情

决策树分类器 sklearn 中节点的不同颜色表示啥？

】决策树分类器sklearn中节点的不同颜色表示啥？【英文标题】：whatdoesdifferentcolorofnodesindecisiontreeclassifiersklearnindicate?决策树分类器sklearn中节点的不同颜色表示什么？【发布时间】：2021-08-2305:25:02【问题描述】：我正在尝试可视... 查看详情