决策树

Background Knowledge

sample group $D$

attribute $a$

information entropy

$p_k = \frac{\sum(sort = k)}{|D|} \quad (k = 1,2,\ldots,|\mathscr{Y}|)$ $\text{Ent}(D) = -\sum_{k=1}^{|\mathscr{Y}|}p_k\log_2p_k$

information gain

$a = \\{a^1, a^2, \ldots, a^V \\}$

$ D^v = \\{D(\text{attribute}(a) = a^v)\\}$

$\text{Gain}(D, a) = Ent(D) - \sum_{v = 1}^{V} \frac{|D^v|}{|D|}Ent(D^v)$

gain ratio

$\text{Gain\_ratio}(D,a) = \frac{Gain(D,a)}{IV(a)}$ $\text{IV}(a) = -\sum_{v = 1}^{V}\frac{|D^v|}{|D|}\log_2\frac{|D^v|}{|D|}$

Gini index

$\text{Gini}(D^v) = \sum_{k = 1}^{|\mathscr{Y}|}\sum_{k' \neq k}p_k p_{k'}$ $\text{Gini}(D^v) = 1 - \sum_{k = 1}^{|\mathscr{Y}|}p_k^{2}$ $\text{Gini\_index}(D,a) = \sum_{v = 1}^{V} \frac{|D^v|}{|D|} \text{Gini}(D^v)$

test

test 1

Table 1: Loan Application Sample Dataset
This table presents the features of 15 applicants (Age, Employed, Owns House, Credit Status) and their loan application results (Class). This dataset is commonly used for training and analyzing decision tree algorithms.

ID	Age	Employed	Owns House	Credit Status	Class
1	Young	No	No	Fair	No
2	Young	No	No	Good	No
3	Young	Yes	No	Good	Yes
4	Young	Yes	Yes	Fair	Yes
5	Young	No	No	Fair	No
6	Middle-aged	No	No	Fair	No
7	Middle-aged	No	No	Good	No
8	Middle-aged	Yes	Yes	Good	Yes
9	Middle-aged	No	Yes	Excellent	Yes
10	Middle-aged	No	Yes	Excellent	Yes
11	Senior	No	Yes	Excellent	Yes
12	Senior	No	Yes	Good	Yes
13	Senior	Yes	No	Good	Yes
14	Senior	Yes	No	Excellent	Yes
15	Senior	No	No	Fair	No

ID3

Firstly

$\text{Ent}(D) = -(\frac{9}{15}\log_2\frac{9}{15} + \frac{6}{15}\log_2\frac{6}{15}) = 0.971$

Then

$\text{Gain}(D, \text{'Age'}) = \text{Ent}(D) - \sum_{v=1}^{3}\frac{|D^v|}{|D|}\text{Ent}(D^v)$

</span>

$\text{Gain}(D,\text{'Age'}) = 0.971 + \frac{1}{3} \times (\frac{2}{5}\log_2\frac{2}{5} + \frac{3}{5}\log_2\frac{3}{5} + \frac{2}{5}\log_2\frac{2}{5} + \frac{3}{5}\log_2\frac{3}{5} + \frac{1}{5}\log_2\frac{1}{5} + \frac{4}{5}\log_2\frac{4}{5}) = 0.083$

$\text{Gain}(D, \text{'Employed'}) = \text{Ent}(D) - \sum_{v=1}^{2}\frac{|D^v|}{|D|}\text{Ent}(D^v)$

</span>

$\text{Gain}(D,\text{'Employed'}) = 0.971 + \frac{10}{15} \times (\frac{6}{10}\log_2\frac{6}{10} + \frac{4}{10}\log_2\frac{4}{10}) + \frac{5}{10} \times (\frac{5}{5}\log_2\frac{5}{5}) = 0.324$

$\text{Gain}(D, \text{'Owns House'}) = \text{Ent}(D) - \sum_{v=1}^{2}\frac{|D^v|}{|D|}\text{Ent}(D^v)$

</span>

$\text{Gain}(D,\text{'Owns House'}) = 0.971 + \frac{9}{15} \times (\frac{3}{9} \log_2 \frac{3}{9} + \frac{6}{9} \log_2 \frac{6}{9}) + \frac{6}{15} \times (\frac{6}{6} \log_2 \frac{6}{6}) = \mathbf{0.420}$

$\text{Gain}(D, \text{'Credit Status'}) = \text{Ent}(D) - \sum_{v=1}^{3}\frac{|D^v|}{|D|}\text{Ent}(D^v)$

</span>

$\text{Gain}(D, \text{'Credit Status'}) = 0.971 + \frac{5}{15} \times (\frac{4}{5} \log_2 \frac{4}{5} + \frac{1}{5} \log_2 \frac{1}{5}) + \frac{6}{15} \times (\frac{4}{6} \log_2 \frac{4}{6} + \frac{2}{6} \log_2 \frac{2}{6}) + \frac{4}{15} \times (\frac{4}{4} \log_2 \frac{4}{4})$ $= 0.363$

Since attribute ‘Owns House’ has the highest information gain, it is selected as the optimal feature. Dividing D by attribute ‘Owns House’, we get two data sets($D_1$ and $D_2$). All labels in $D_1$ are ‘yes’, so we set $D_1$ as a leaf node with label ‘yes’. But for $D_2$, we need divide it.

graph TD
    A[Owns House] -- yes --> B["yes, D1"]
    A[Owns House] -- no --> C["D2"]

Figure 1: first step

$\text{Ent}(D_2) = -\frac{6}{9} \log_2 \frac{6}{9} -\frac{3}{9} \log_2 \frac{3}{9} = 0.918$ $\text{Gain}(D_2, \text{'Age'}) = 0.918 + \frac{4}{9} \times (\frac{1}{4} \log_2 \frac{1}{4} + \frac{3}{4} \log_2 \frac{3}{4}) + \frac{2}{9} \times (\frac{2}{2} \ log_2 \frac{2}{2}) + \frac{3}{9} \times (\frac{2}{3} \log_2 \frac{2}{3} + \frac{1}{3} \log_2 \frac{1}{3}) = 0.251$ $\text{Gain}(D_2, \text{'Employed'}) = 0.918 + \frac{6}{9} \times (\frac{6}{6} \log_2 \frac{6}{6}) + \frac{3}{9} \times (\frac{3}{3} \log_2 \frac{3}{3}) = 0.918$ $\text{Gain}(D_2, \text{'Credit Status'}) = 0.918 + \frac{4}{9} \times (\frac{4}{4} \log_2 \frac{4}{4}) + \frac{4}{9} \times (2 \times \frac{2}{4} \log_2 \frac{2}{4}) + \frac{1}{9} \times \frac{1}{1} \log_2 \frac{1}{1} = 0.474$

Since attribute ‘Employed’ has the highest information gain, it is selected as the optimal feature. Apparently, we get the final result.

graph TD
    A[Owns House] -- yes --> B["yes"]
    A[Owns House] -- no --> C["Employed"]
    C["Employed"] -- yes --> D["yes"]
    C["Employed"] -- no --> E["no"]

Figure 2: finial decision tree

Pruning

Definition: loss function

Suppose a decisionTree having $T$ leaf nodes, each leaf node $t$ having $N_t$ samples and there are $N_{tk}$ sort $k = 1, 2, \ldots, K$ samples.

loss function

$C_\alpha(T) = \sum_{t=1}^{|T|}N_tH_t(T) + \alpha |T|$

empirical entropy

$H_t(T) = -\sum_{k}\frac{N_{tk}}{N_t} \log_2 \frac{N_{tk}}{N_t}$

combination

$C(T) = -\sum_{t=1}^{|T|}\sum_{k}^{K}N_{tk} \log_2 \frac{N_{tk}}{N_t}$ $C_\alpha(T) = C(T) + \alpha |T|$

purpose

$\min_{T} C_\alpha(T) = \sum_{t=1}^{|T|}N_tH_t(T) + \alpha |T|$

algorithm

We can define a DP[] and use the idea of dynamic programming to find the optimal pruning

input data $X = \\{(\vec{x}_1, y_1), (\vec{x}_2, y_2), \ldots, (\vec{x}_N, y_N) \\}$
output $Tree_\alpha$

for t in T, Calculate $H_t(T)$
Recursively backtrack from the leaf nodes of the tree upwards, compare $C_\alpha(T_A)$ and $C_\alpha(T_B)$
return 2

CART

Regression Tree

A regression tree corresponds to a partition of the input space and the output values within the partition units.

input space $M = R_1, R_2, \ldots, R_M$, and each space $R_m$ has a output value $c_m$, we can define regression tree model as follow.

regression tree model

$f(x) = \sum_{m=1}^{M} c_m I (x \in R_m)$

Mean Squared Error

$\sum_{x_i \in R_m} (y_i - f(x_i))^2$ $\hat{c}_m = ave(y_i \mid x_i \in R_m)$

dividing algorithm

input: training data $D$
output: regression tree $f(x)$

choose $x_j$, its output is $s$. Space $R_1(j, s) = \\{x \mid y_i \leq s\\}$,space $R_2(j, s) = \\{x \mid y_i > s\\}$
$\min_{j,s}[\min_{c_1} \sum_{x_i \in R_1(j,s)} (y_i - c_1)^2 + \min_{c_2} \sum_{x_i \in R_2(j,s)} (y_i - c_2)^2]$
$\frac{1}{N_m} \sum_{x_i \in R_m(j,s)} y_i, \quad x_i \in R_m, \quad m = 1, 2$
repeat 2 and 3

Classification Tree

Gini index

$\text{Gini} (p) = \sum_{k = 1}^{K} p_k(1 - p_k) = 1 - \sum_{k = 1}^{K} p_k^2$ $\text{Gini} (D, A) = \sum_{v = 1}^{\mathscr{Y}} \frac{|D^v|}{|D|} \text{Gini}(D^v)$

机器学习统计学习方法

#算法

决策树

http://example.com/2025/05/28/决策树/

作者

ddccffq

发布于

2025年5月28日

许可协议

hexo界面和GitHub page界面公式问题上一篇

隐马尔科夫模型下一篇