python如何实现决策树

Python实现决策树的步骤可以概括为：数据预处理、选择特征、构建节点、递归划分、剪枝、评估模型。 在这些步骤中，选择特征是非常关键的一步，因为它决定了每个节点的划分方式和最终模型的准确性。选择特征通常使用信息增益、基尼指数或卡方检验等方法。

一、数据预处理

数据预处理是构建决策树的基础步骤，包括数据清洗、数据编码、数据标准化等。数据清洗是指处理缺失值、异常值和重复数据。数据编码是将分类变量转换为数值变量，常用的方法有独热编码（One-Hot Encoding）和标签编码（Label Encoding）。数据标准化是指将数据缩放到相同的尺度上，以便于后续建模。

1、数据清洗

数据清洗是数据预处理的第一步，处理缺失值、异常值和重复数据是数据清洗的主要内容。在处理缺失值时，可以选择删除含有缺失值的样本，或者使用均值、中位数、众数等进行填补。处理异常值时，可以选择将其删除或替换为合理值。对于重复数据，可以选择删除或合并。

import pandas as pd
读取数据
data = pd.read_csv('data.csv')
删除含有缺失值的样本
data = data.dropna()
使用均值填补缺失值
data.fillna(data.mean(), inplace=True)
删除重复数据
data = data.drop_duplicates()

2、数据编码

数据编码是将分类变量转换为数值变量，常用的方法有独热编码和标签编码。独热编码是将每个分类变量转换为多个二进制变量，标签编码是将每个分类变量转换为一个整数变量。

from sklearn.preprocessing import OneHotEncoder, LabelEncoder
独热编码
encoder = OneHotEncoder()
encoded_data = encoder.fit_transform(data[['categorical_column']])
标签编码
label_encoder = LabelEncoder()
encoded_data = label_encoder.fit_transform(data['categorical_column'])

3、数据标准化

数据标准化是将数据缩放到相同的尺度上，以便于后续建模。常用的方法有标准化（Standardization）和归一化（Normalization）。标准化是将数据的均值变为0，标准差变为1。归一化是将数据缩放到[0, 1]的范围内。

from sklearn.preprocessing import StandardScaler, MinMaxScaler
标准化
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data[['numerical_column']])
归一化
minmax_scaler = MinMaxScaler()
normalized_data = minmax_scaler.fit_transform(data[['numerical_column']])

二、选择特征

选择特征是构建决策树的关键步骤，决定了每个节点的划分方式和最终模型的准确性。常用的方法有信息增益、基尼指数和卡方检验。

1、信息增益

信息增益是衡量特征对数据分类能力的一种指标，信息增益越大，特征对数据分类的能力越强。信息增益是通过计算划分前后的信息熵来确定的，信息熵越小，数据的纯度越高。

import numpy as np
from collections import Counter
def entropy(y):
    counts = Counter(y)
    probabilities = [count / len(y) for count in counts.values()]
    return -sum(p * np.log2(p) for p in probabilities)
def information_gain(X, y, feature_index):
    unique_values = np.unique(X[:, feature_index])
    weighted_entropy = 0
    for value in unique_values:
        subset_y = y[X[:, feature_index] == value]
        weighted_entropy += len(subset_y) / len(y) * entropy(subset_y)
    return entropy(y) - weighted_entropy

2、基尼指数

基尼指数是衡量数据纯度的一种指标，基尼指数越小，数据的纯度越高。基尼指数是通过计算数据中两两样本不一致的概率来确定的。

def gini(y):
    counts = Counter(y)
    probabilities = [count / len(y) for count in counts.values()]
    return 1 - sum(p  2 for p in probabilities)
def gini_index(X, y, feature_index):
    unique_values = np.unique(X[:, feature_index])
    weighted_gini = 0
    for value in unique_values:
        subset_y = y[X[:, feature_index] == value]
        weighted_gini += len(subset_y) / len(y) * gini(subset_y)
    return weighted_gini

3、卡方检验

卡方检验是衡量特征与目标变量之间独立性的一种指标，通过计算实际频数和期望频数之间的差异来确定。卡方值越大，说明特征与目标变量之间的关系越强。

from scipy.stats import chi2_contingency
def chi_square(X, y, feature_index):
    contingency_table = pd.crosstab(X[:, feature_index], y)
    chi2, p, dof, expected = chi2_contingency(contingency_table)
    return chi2

三、构建节点

构建节点是决策树的核心步骤，根据选择的特征对数据进行划分，生成子节点。每个节点包含一个特征和一个阈值，用于将数据划分为两个子集。子节点继续递归划分，直到满足停止条件。

1、定义节点类

首先，定义一个节点类，用于存储节点的特征、阈值、左子节点和右子节点。

class Node:
    def __init__(self, feature_index=None, threshold=None, left=None, right=None, value=None):
        self.feature_index = feature_index
        self.threshold = threshold
        self.left = left
        self.right = right
        self.value = value

2、选择最佳特征和阈值

在每个节点，选择信息增益最大的特征和阈值进行划分。

def best_split(X, y, criterion='information_gain'):
    best_feature_index, best_threshold, best_score = None, None, float('-inf')
    for feature_index in range(X.shape[1]):
        thresholds = np.unique(X[:, feature_index])
        for threshold in thresholds:
            left_mask = X[:, feature_index] <= threshold
            right_mask = X[:, feature_index] > threshold
            left_y, right_y = y[left_mask], y[right_mask]
            if criterion == 'information_gain':
                score = information_gain(X, y, feature_index)
            elif criterion == 'gini':
                score = -gini_index(X, y, feature_index)
            elif criterion == 'chi_square':
                score = chi_square(X, y, feature_index)
            if score > best_score:
                best_feature_index, best_threshold, best_score = feature_index, threshold, score
    return best_feature_index, best_threshold

3、构建节点

根据选择的特征和阈值，对数据进行划分，生成子节点。

def build_tree(X, y, criterion='information_gain', max_depth=None, min_samples_split=2):
    if len(np.unique(y)) == 1 or len(y) < min_samples_split or max_depth == 0:
        return Node(value=Counter(y).most_common(1)[0][0])
    feature_index, threshold = best_split(X, y, criterion)
    if feature_index is None:
        return Node(value=Counter(y).most_common(1)[0][0])
    left_mask = X[:, feature_index] <= threshold
    right_mask = X[:, feature_index] > threshold
    left_node = build_tree(X[left_mask], y[left_mask], criterion, max_depth - 1, min_samples_split)
    right_node = build_tree(X[right_mask], y[right_mask], criterion, max_depth - 1, min_samples_split)
    return Node(feature_index=feature_index, threshold=threshold, left=left_node, right=right_node)

四、递归划分

递归划分是决策树的核心算法，通过递归地构建节点，不断对数据进行划分，直到满足停止条件。

1、定义停止条件

停止条件是递归划分的终止条件，常用的停止条件包括达到最大深度、样本数量小于最小样本数、数据纯度达到100%等。

def is_leaf(node):
    return node.value is not None
def stopping_criteria(node, max_depth, min_samples_split):
    return (max_depth is not None and max_depth == 0) or (len(node.y) < min_samples_split) or is_leaf(node)

2、递归划分

递归地构建节点，不断对数据进行划分，直到满足停止条件。

def recursive_split(node, max_depth, min_samples_split, criterion):
    if stopping_criteria(node, max_depth, min_samples_split):
        return
    feature_index, threshold = best_split(node.X, node.y, criterion)
    if feature_index is None:
        return
    left_mask = node.X[:, feature_index] <= threshold
    right_mask = node.X[:, feature_index] > threshold
    left_node = Node(X=node.X[left_mask], y=node.y[left_mask])
    right_node = Node(X=node.X[right_mask], y=node.y[right_mask])
    node.feature_index = feature_index
    node.threshold = threshold
    node.left = left_node
    node.right = right_node
    recursive_split(left_node, max_depth - 1, min_samples_split, criterion)
    recursive_split(right_node, max_depth - 1, min_samples_split, criterion)

五、剪枝

剪枝是减少决策树复杂度的一种方法，通过剪掉不必要的节点，减少过拟合。常用的剪枝方法有预剪枝和后剪枝。

1、预剪枝

预剪枝是在构建决策树的过程中，通过设置停止条件来限制决策树的深度和复杂度。

def pre_pruning(node, max_depth, min_samples_split, criterion):
    if stopping_criteria(node, max_depth, min_samples_split):
        return
    feature_index, threshold = best_split(node.X, node.y, criterion)
    if feature_index is None:
        return
    left_mask = node.X[:, feature_index] <= threshold
    right_mask = node.X[:, feature_index] > threshold
    left_node = Node(X=node.X[left_mask], y=node.y[left_mask])
    right_node = Node(X=node.X[right_mask], y=node.y[right_mask])
    node.feature_index = feature_index
    node.threshold = threshold
    node.left = left_node
    node.right = right_node
    pre_pruning(left_node, max_depth - 1, min_samples_split, criterion)
    pre_pruning(right_node, max_depth - 1, min_samples_split, criterion)

2、后剪枝

后剪枝是在构建完决策树后，通过剪掉不必要的节点来减少决策树的复杂度。

def post_pruning(node, X_val, y_val):
    if is_leaf(node):
        return
    if node.left:
        post_pruning(node.left, X_val, y_val)
    if node.right:
        post_pruning(node.right, X_val, y_val)
    if is_leaf(node.left) and is_leaf(node.right):
        y_pred = predict(node, X_val)
        y_pred_left = predict(node.left, X_val)
        y_pred_right = predict(node.right, X_val)
        if accuracy_score(y_val, y_pred) >= max(accuracy_score(y_val, y_pred_left), accuracy_score(y_val, y_pred_right)):
            node.left = None
            node.right = None
            node.value = Counter(y_val).most_common(1)[0][0]

六、评估模型

评估模型是衡量决策树性能的重要步骤，常用的评估指标有准确率、精确率、召回率、F1值等。

1、准确率

准确率是指分类正确的样本占总样本的比例。

from sklearn.metrics import accuracy_score
def evaluate_accuracy(tree, X_test, y_test):
    y_pred = predict(tree, X_test)
    return accuracy_score(y_test, y_pred)

2、精确率

精确率是指分类为正类的样本中实际为正类的比例。

from sklearn.metrics import precision_score
def evaluate_precision(tree, X_test, y_test):
    y_pred = predict(tree, X_test)
    return precision_score(y_test, y_pred)

3、召回率

召回率是指实际为正类的样本中被分类为正类的比例。

from sklearn.metrics import recall_score
def evaluate_recall(tree, X_test, y_test):
    y_pred = predict(tree, X_test)
    return recall_score(y_test, y_pred)

4、F1值

F1值是精确率和召回率的调和平均数。

from sklearn.metrics import f1_score
def evaluate_f1(tree, X_test, y_test):
    y_pred = predict(tree, X_test)
    return f1_score(y_test, y_pred)

七、示例代码

下面是一个完整的示例代码，实现了决策树的构建、剪枝和评估。

import numpy as np
import pandas as pd
from collections import Counter
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
class Node:
    def __init__(self, feature_index=None, threshold=None, left=None, right=None, value=None):
        self.feature_index = feature_index
        self.threshold = threshold
        self.left = left
        self.right = right
        self.value = value
def entropy(y):
    counts = Counter(y)
    probabilities = [count / len(y) for count in counts.values()]
    return -sum(p * np.log2(p) for p in probabilities)
def information_gain(X, y, feature_index):
    unique_values = np.unique(X[:, feature_index])
    weighted_entropy = 0
    for value in unique_values:
        subset_y = y[X[:, feature_index] == value]
        weighted_entropy += len(subset_y) / len(y) * entropy(subset_y)
    return entropy(y) - weighted_entropy
def best_split(X, y, criterion='information_gain'):
    best_feature_index, best_threshold, best_score = None, None, float('-inf')
    for feature_index in range(X.shape[1]):
        thresholds = np.unique(X[:, feature_index])
        for threshold in thresholds:
            left_mask = X[:, feature_index] <= threshold
            right_mask = X[:, feature_index] > threshold
            left_y, right_y = y[left_mask], y[right_mask]
            if criterion == 'information_gain':
                score = information_gain(X, y, feature_index)
            if score > best_score:
                best_feature_index, best_threshold, best_score = feature_index, threshold, score
    return best_feature_index, best_threshold
def build_tree(X, y, criterion='information_gain', max_depth=None, min_samples_split=2):
    if len(np.unique(y)) == 1 or len(y) < min_samples_split or max_depth == 0:
        return Node(value=Counter(y).most_common(1)[0][0])
    feature_index, threshold = best_split(X, y, criterion)
    if feature_index is None:
        return Node(value=Counter(y).most_common(1)[0][0])
    left_mask = X[:, feature_index] <= threshold
    right_mask = X[:, feature_index] > threshold
    left_node = build_tree(X[left_mask], y[left_mask], criterion, max_depth - 1, min_samples_split)
    right_node = build_tree(X[right_mask], y[right_mask], criterion, max_depth - 1, min_samples_split)
    return Node(feature_index=feature_index, threshold=threshold, left=left_node, right=right_node)
def is_leaf(node):
    return node.value is not None
def predict(node, X):
    if is_leaf(node):
        return node.value
    if X[node.feature_index] <= node.threshold:
        return predict(node.left, X)
    else:
        return predict(node.right, X)
def evaluate_accuracy(tree, X_test, y_test):
    y_pred = [predict(tree, x) for x in X_test]
    return accuracy_score(y_test, y_pred)
def evaluate_precision(tree, X_test, y_test):
    y_pred = [predict(tree, x) for x in X_test]
    return precision_score(y_test, y_pred)
def evaluate_recall(tree, X_test, y_test):
    y_pred = [predict(tree, x) for x in X_test]
    return recall_score(y_test, y_pred)
def evaluate_f1(tree, X_test, y_test):
    y_pred = [predict(tree, x) for x in X_test]
    return f1_score(y_test, y_pred)
示例数据集