全连接网络

在本篇文章中,我们将通过代码实现与数学推导,系统讲解全连接神经网络(Fully Connected Neural Network, FCNN)在手写数字识别任务中的应用。文章不仅涵盖了数据预处理、模型搭建、前向与反向传播的详细过程,还对梯度计算的数学原理进行了推导说明。通过对不同批次大小的训练结果进行对比分析,展示了 mini-batch 梯度下降在实际深度学习任务中的优势,并进一步探讨了 L1/L2 正则化对模型泛化能力的影响。本文适合希望深入理解神经网络原理与实践的读者。

代码实现与数学原理

首先下载手写数据集

1
2
3
4
5
6
7
8
9
import torchvision
import numpy as np
import random

mnist_train = torchvision.datasets.MNIST(root='~/Datasets/MNIST', train=True,
download=True, transform=None)
mnist_test = torchvision.datasets.MNIST(root='~/Datasets/MNIST', train=False,
download=True, transform=None)
print(mnist_train[0])

运行后显示的结果是 (<PIL.Image.Image image mode=L size=28x28 at 0x1397411D0>, 5),这说明这个数据集中的每个数据的 data 部分是一张 \(28 \times 28\) 大小的张量,targets 是一个整数,据此进行整理与归一化。

1
2
3
4
5
6
7
X_train = np.array(mnist_train.data) / 255.0
y_train = np.array(mnist_train.targets)
X_test = np.array(mnist_test.data) / 255.0
y_test = np.array(mnist_test.targets)

X_train = X_train.reshape(-1, 28 * 28)
X_test = X_test.reshape(-1, 28 * 28)

考虑使用 batch GD 算法,分批代码如下:

1
2
3
4
5
6
7
def data_iter(batch_size, features, labels):
num_examples = len(features)
indices = list(range(num_examples))
random.shuffle(indices)
for i in range(0, num_examples, batch_size):
j = np.array(indices[i: min(i + batch_size, num_examples)])
yield features[j], labels[j]

定义激活函数与其导数:

1
2
3
4
5
6
7
8
9
def relu(x): 
return np.maximum(0, x)

def relu_grad(x):
return (x > 0).astype(float)

def softmax(x):
exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
return exp_x / np.sum(exp_x, axis=1, keepdims=True)

定义损失函数:

1
2
def cross_entropy(y, y_hat):
return - np.sum(y * np.log(y_hat)) / len(y)

定义独热编码:

1
2
3
4
def one_hot(y, num_classes):
y_onehot = np.zeros((y.size, num_classes))
y_onehot[np.arange(y.size), y] = 1
return y_onehot

5 变换为 (0, 0, 0, 0, 0, 1, 0, 0, 0, 0)

设计神经网络大小与初始化权重:

1
2
3
4
5
6
7
8
input_size = 784
hidden_size = 128
output_size = 10

w_1 = np.random.normal(0, 1, (input_size, hidden_size))
b_1 = np.zeros((1, hidden_size))
w_2 = np.random.normal(0, 1, (hidden_size, output_size))
b_2 = np.zeros((1, output_size))

从输入层到隐藏层用 \(\text{relu}\),从隐藏层到输出层用 \(\text{softmax}\)。故定义前向传播:

1
2
3
4
5
6
def forward(X):
Z_1 = np.dot(X, w_1) + b_1
A_1 = relu(Z_1)
Z_2 = np.dot(A_1, w_2) + b_2
A_2 = softmax(Z_2)
return Z_1, A_1, Z_2, A_2

先使用交叉熵损失函数,推导反向传播的数学公式。反向传播的核心就是计算损失函数对每一层权重和偏置的导数(梯度),可以给予链式法则求梯度。在输出层,输出记为 \(Y\),除了输出层,\(Z\) 代表线性变换结果,\(A\) 代表激活值,在输入层记输入为 \(X\)

\[ \begin{aligned} Z_1 &= w_1 X + b_1 \\ A_1 &= \text{relu}(Z_1) \\ Z_2 &= w_2 A_1 + b_2 \\ A_{2, j} &= \text{softmax}(Z_{2, j}) = \frac{\text{e}^{Z_{2, j}}}{\sum\limits_{k} \text{e}^{Z_{2, k}}} \end{aligned} \]

在输出层,损失的计算公式为:

\[ L = - \sum_{i = 1}^{C} (Y_i \times \ln A_{2,i}) \]

其中,\(C\) 是类别。对 \(A_2\) 求导:

\[ \frac{\partial L}{\partial A_2} = - \sum_{i = 1}^{C} (Y_i \times \frac{1}{A_{2,i}}) \]

\(A_2 = \text{softmax}(Z_2)\) 得:

\[ \begin{aligned} \frac{\partial L}{\partial Z_{2}} &= \sum_{j} \sum_{i} \frac{\partial L}{\partial A_{2, i}} \cdot \frac{\partial A_{2, i}}{\partial Z_{2, j}} \\ \frac{\partial L}{\partial Z_{2}} &= \biggl(- \frac{Y_i}{A_{2, i}}\biggr) \cdot A_{2, i} (1 - A_{2, i}) + \sum_{i \neq j} \biggl(- \frac{Y_j}{A_{2, j}}\biggr) \cdot (-A_{2, i} A_{2, j}) \\ &= -Y_i + A_{2, i} \sum_{j} Y_j \\ &= A_2 - Y \end{aligned} \]

\(Z_2 = w_2 A_1 + b_2\) 得:

\[ \begin{aligned} \frac{\partial L}{\partial w_2} &= \frac{\partial L}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial w_2} = (A_2 - Y) \cdot A_1 \\ \frac{\partial L}{\partial b_2} &= \frac{\partial L}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial b_2} = (A_2 - Y) \\ \frac{\partial L}{\partial A_1} &= \frac{\partial L}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial A_1} = (A_2 - Y) \cdot w_2 \end{aligned} \]

\(A_1 = \text{relu}(Z_1)\) 得:

\[ \frac{\partial L}{\partial Z_1} = \frac{\partial L}{\partial A_1} \cdot \frac{\partial A_1}{\partial Z_1} = (A_2 - Y) \cdot w_2 \cdot \frac{\partial A_1}{\partial Z_1} \]

其中,\(\frac{\partial A_1}{\partial Z_1}\) 根据 \(\text{relu}\) 函数的定义易得,见代码部分。

\(Z_1 = w_1 X + b_1\) 得类似结果。

推导过程中没有考虑真实的矩阵形状。

结果分析

上述提供了一个 batch GD 算法,可以用其分析小批量与大批量之间的优劣。具体结果如下表所示,从中可以看到小批量 mini-batch GD 思路可以得到较好的结果。

批次大小 (Batch size) 最大训练准确率 (Max Acc) 测试准确率 (Test Accuracy)
10 0.9696 0.9455
20 0.9615 0.9421
30 0.9543 0.9354
40 0.9506 0.9368
50 0.9423 0.9300
60 0.9376 0.9195
70 0.9385 0.9242
80 0.9382 0.9240
90 0.9301 0.9191
500 0.8982 0.8908
600 0.8913 0.8855
700 0.8877 0.8896
800 0.8912 0.8891
900 0.8872 0.8830

改进

当前代码没有尝试正则化,下面尝试用 L2 正则化防止机器学习模型过拟合。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
...existing code...
lr = 0.1
num_epochs = 10
batch_size = 64
l2_lambda = 0.01

...existing code...
dZ_2 = A_2 - y_onehot # (batch_size, 10)
dw_2 = np.dot(A_1.T, dZ_2) / batch_size + l2_lambda * w_2 # (784, 10)
db_2 = np.sum(dZ_2, axis=0, keepdims=True) / batch_size # (1, 10)

dA_1 = np.dot(dZ_2, w_2.T) # (batch_size, 128)
dZ_1 = dA_1 * relu_grad(Z_1) # (batch_size, 128)
dw_1 = np.dot(X_batch.T, dZ_1) / batch_size + l2_lambda * w_1# (784, 128)
db_1 = np.sum(dZ_1, axis=0, keepdims=True) / batch_size # (1, 784)
...existing code...

相比于没有正则化,二者模型比对结果如下表所示,可以发现,二者最大训练准确率几乎一致,但测试准确率相差近 \(1\%\)

正则化 (Regularization) 最大训练准确率 (Max Acc) 测试准确率 (Test Accuracy)
0.9363 0.9370
0.9364 0.9264

接着尝试用 L1 正则化防止机器学习模型过拟合。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
...existing code...
lr = 0.1
num_epochs = 10
batch_size = 64
l1_lambda = 0.01

...existing code...
dZ_2 = A_2 - y_onehot # (batch_size, 10)
dw_2 = np.dot(A_1.T, dZ_2) / batch_size + l1_lambda * np.sign(w_2) # (784, 10)
db_2 = np.sum(dZ_2, axis=0, keepdims=True) / batch_size # (1, 10)

dA_1 = np.dot(dZ_2, w_2.T) # (batch_size, 128)
dZ_1 = dA_1 * relu_grad(Z_1) # (batch_size, 128)
dw_1 = np.dot(X_batch.T, dZ_1) / batch_size + l1_lambda * np.sign(w_1)# (784, 128)
db_1 = np.sum(dZ_1, axis=0, keepdims=True) / batch_size # (1, 784)
...existing code...

L1 正则化与 L2 正则化在相同参数下结果如下表所示,可见此时 L1 正则化的效果远不及 L2 正则化。

正则化 (Regularization) 最大训练准确率 (Max Acc) 测试准确率 (Test Accuracy)
L2 0.9363 0.9370
L1 0.8463 0.8373

将学习率设置为 0.001,此时 L1 的性能有所增长,以极小的差距超越无正则化的结果。

正则化 (Regularization) 最大训练准确率 (Max Acc) 测试准确率 (Test Accuracy)
L1 0.9341 0.9294

参考代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
import numpy as np
import torchvision
import random

mnist_train = torchvision.datasets.MNIST(root='~/Datasets/MNIST', train=True,
download=True, transform=None)
mnist_test = torchvision.datasets.MNIST(root='~/Datasets/MNIST', train=False,
download=True, transform=None)

print(mnist_train[0]) # (<PIL.Image.Image image mode=L size=28x28 at 0x1397411D0>, 5)

X_train = np.array(mnist_train.data) / 255.0
y_train = np.array(mnist_train.targets)
X_test = np.array(mnist_test.data) / 255.0
y_test = np.array(mnist_test.targets)

X_train = X_train.reshape(-1, 28 * 28)
X_test = X_test.reshape(-1, 28 * 28)

def data_iter(batch_size, features, labels):
num_examples = len(features)
indices = list(range(num_examples))
random.shuffle(indices)
for i in range(0, num_examples, batch_size):
j = np.array(indices[i: min(i + batch_size, num_examples)])
yield features[j], labels[j]

input_size = 784
hidden_size = 128
output_size = 10

w_1 = np.random.normal(0, 1, (input_size, hidden_size))
b_1 = np.zeros((1, hidden_size))
w_2 = np.random.normal(0, 1, (hidden_size, output_size))
b_2 = np.zeros((1, output_size))

def relu(x):
return np.maximum(0, x)

def relu_grad(x):
return (x > 0).astype(float)

def softmax(x):
exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
return exp_x / np.sum(exp_x, axis=1, keepdims=True)

def cross_entropy(y, y_hat):
return - np.sum(y * np.log(y_hat)) / len(y)

def one_hot(y, num_classes):
y_onehot = np.zeros((y.size, num_classes))
y_onehot[np.arange(y.size), y] = 1
return y_onehot

lr = 0.1
num_epochs = 10
batch_size = 64
l1_lambda = 0.001

def forward(X):
Z_1 = np.dot(X, w_1) + b_1 # (batch_size, 128)
A_1 = relu(Z_1)
Z_2 = np.dot(A_1, w_2) + b_2
A_2 = softmax(Z_2)
return Z_1, A_1, Z_2, A_2

for epoch in range(1, num_epochs + 1):
for X_batch, y_batch in data_iter(batch_size, X_train, y_train):
Z_1, A_1, Z_2, A_2 = forward(X_batch)

y_onehot = one_hot(y_batch, output_size)

dZ_2 = A_2 - y_onehot # (batch_size, 10)
dw_2 = np.dot(A_1.T, dZ_2) / batch_size+ l1_lambda * np.sign(w_2) # (784, 10)
db_2 = np.sum(dZ_2, axis=0, keepdims=True) / batch_size # (1, 10)

dA_1 = np.dot(dZ_2, w_2.T) # (batch_size, 128)
dZ_1 = dA_1 * relu_grad(Z_1) # (batch_size, 128)
dw_1 = np.dot(X_batch.T, dZ_1) / batch_size + l1_lambda * np.sign(w_1) # (784, 128)
db_1 = np.sum(dZ_1, axis=0, keepdims=True) / batch_size # (1, 784)

w_1 -= lr * dw_1
b_1 -= lr * db_1
w_2 -= lr * dw_2
b_2 -= lr * db_2

_, _, _, train_pred = forward(X_train) # 概率分布
train_loss = cross_entropy(one_hot(y_train, output_size), train_pred)
train_acc = np.mean(np.argmax(train_pred, axis=1) == y_train)
print(f"Epoch {epoch}, Loss: {train_loss:.4f}, Acc: {train_acc:.4f}")

_, _, _, test_pred = forward(X_test)
test_acc = np.mean(np.argmax(test_pred, axis=1) == y_test)
print(f"Test Accuracy: {test_acc:.4f}")

全连接网络
https://ddccffq.github.io/2025/10/17/神经网络与深度学习/全连接网络/
作者
ddccffq
发布于
2025年10月17日
许可协议