手把手教会你用Python手动实现简易决策树

2023-09-19 09:14:29

1. 决策树简介

决策树是一种常见的机器学习算法模型，常用于分类问题。它以树形结构来表征数据，并将数据从根节点逐步分类至叶子节点。决策树的构建过程为：

选择最优分割属性：从所有属性中选择一个能够最佳划分数据集的属性。
划分数据集：根据选定的属性将数据集划分为多个子集。
重复步骤1和2：对每个子集递归地应用上述步骤，直到每个子集都属于同一类别。

2. 手动实现简易决策树

我们将使用一个简单的离散区间的分类问题来演示决策树的建立过程。假设我们有一个数据集，其中每个数据点包含两个属性：年龄和收入。我们需要根据这两个属性来判断数据点属于哪个类别（高收入或低收入）。

import numpy as np

# 定义数据集
dataset = np.array([
    [30, 50000],
    [25, 30000],
    [40, 70000],
    [35, 55000],
    [28, 40000],
    [42, 60000],
    [33, 52000],
    [27, 35000],
    [45, 75000],
    [38, 60000]
])

# 定义目标值
target = np.array([
    "高收入",
    "低收入",
    "高收入",
    "中收入",
    "低收入",
    "高收入",
    "中收入",
    "低收入",
    "高收入",
    "中收入"
])

# 构建决策树
def build_tree(dataset, target):
    # 计算信息增益
    def calculate_information_gain(dataset, target):
        # 计算熵
        def calculate_entropy(dataset):
            labels = np.unique(dataset[:, -1])
            entropy = 0
            for label in labels:
                p = np.mean(dataset[:, -1] == label)
                entropy -= p * np.log2(p)
            return entropy

        # 计算条件熵
        def calculate_conditional_entropy(dataset, attribute):
            unique_values = np.unique(dataset[:, attribute])
            conditional_entropy = 0
            for value in unique_values:
                subset = dataset[dataset[:, attribute] == value]
                conditional_entropy += (np.size(subset, 0) / np.size(dataset, 0)) * calculate_entropy(subset)
            return conditional_entropy

        # 计算信息增益
        entropy = calculate_entropy(dataset)
        conditional_entropy = calculate_conditional_entropy(dataset, attribute)
        information_gain = entropy - conditional_entropy
        return information_gain

    # 选择最优分割属性
    attributes = np.size(dataset, 1) - 1
    max_information_gain = 0
    best_attribute = None
    for attribute in range(attributes):
        information_gain = calculate_information_gain(dataset, attribute)
        if information_gain > max_information_gain:
            max_information_gain = information_gain
            best_attribute = attribute

    # 划分数据集
    unique_values = np.unique(dataset[:, best_attribute])
    subtrees = {}
    for value in unique_values:
        subset = dataset[dataset[:, best_attribute] == value]
        subtrees[value] = build_tree(subset, target[dataset[:, best_attribute] == value])

    # 返回决策树
    return best_attribute, subtrees

# 训练决策树
tree = build_tree(dataset, target)

# 使用决策树对新的数据进行分类
def predict(data, tree):
    attribute, subtrees = tree
    value = data[attribute]
    subtree = subtrees[value]
    if isinstance(subtree, str):
        return subtree
    else:
        return predict(data, subtree)

# 测试决策树
new_data = np.array([32, 45000])
prediction = predict(new_data, tree)
print("预测结果：", prediction)