在本篇文章中,我们将通过代码实现与数学推导,系统讲解全连接神经网络(Fully Connected Neural Network, FCNN)在手写数字识别任务中的应用。文章不仅涵盖了数据预处理、模型搭建、前向与反向传播的详细过程,还对梯度计算的数学原理进行了推导说明。通过对不同批次大小的训练结果进行对比分析,展示了 mini-batch 梯度下降在实际深度学习任务中的优势,并进一步探讨了 L1/L2 正则化对模型泛化能力的影响。本文适合希望深入理解神经网络原理与实践的读者。
代码实现与数学原理
首先下载手写数据集
1 2 3 4 5 6 7 8 9
| import torchvision import numpy as np import random
mnist_train = torchvision.datasets.MNIST(root='~/Datasets/MNIST', train=True, download=True, transform=None) mnist_test = torchvision.datasets.MNIST(root='~/Datasets/MNIST', train=False, download=True, transform=None) print(mnist_train[0])
|
运行后显示的结果是 (<PIL.Image.Image image mode=L size=28x28 at 0x1397411D0>, 5),这说明这个数据集中的每个数据的 data 部分是一张 \(28 \times 28\) 大小的张量,targets 是一个整数,据此进行整理与归一化。
1 2 3 4 5 6 7
| X_train = np.array(mnist_train.data) / 255.0 y_train = np.array(mnist_train.targets) X_test = np.array(mnist_test.data) / 255.0 y_test = np.array(mnist_test.targets)
X_train = X_train.reshape(-1, 28 * 28) X_test = X_test.reshape(-1, 28 * 28)
|
考虑使用 batch GD 算法,分批代码如下:
1 2 3 4 5 6 7
| def data_iter(batch_size, features, labels): num_examples = len(features) indices = list(range(num_examples)) random.shuffle(indices) for i in range(0, num_examples, batch_size): j = np.array(indices[i: min(i + batch_size, num_examples)]) yield features[j], labels[j]
|
定义激活函数与其导数:
1 2 3 4 5 6 7 8 9
| def relu(x): return np.maximum(0, x)
def relu_grad(x): return (x > 0).astype(float)
def softmax(x): exp_x = np.exp(x - np.max(x, axis=1, keepdims=True)) return exp_x / np.sum(exp_x, axis=1, keepdims=True)
|
定义损失函数:
1 2
| def cross_entropy(y, y_hat): return - np.sum(y * np.log(y_hat)) / len(y)
|
定义独热编码:
1 2 3 4
| def one_hot(y, num_classes): y_onehot = np.zeros((y.size, num_classes)) y_onehot[np.arange(y.size), y] = 1 return y_onehot
|
5 变换为 (0, 0, 0, 0, 0, 1, 0, 0, 0, 0)
设计神经网络大小与初始化权重:
1 2 3 4 5 6 7 8
| input_size = 784 hidden_size = 128 output_size = 10
w_1 = np.random.normal(0, 1, (input_size, hidden_size)) b_1 = np.zeros((1, hidden_size)) w_2 = np.random.normal(0, 1, (hidden_size, output_size)) b_2 = np.zeros((1, output_size))
|
从输入层到隐藏层用 \(\text{relu}\),从隐藏层到输出层用 \(\text{softmax}\)。故定义前向传播:
1 2 3 4 5 6
| def forward(X): Z_1 = np.dot(X, w_1) + b_1 A_1 = relu(Z_1) Z_2 = np.dot(A_1, w_2) + b_2 A_2 = softmax(Z_2) return Z_1, A_1, Z_2, A_2
|
先使用交叉熵损失函数,推导反向传播的数学公式。反向传播的核心就是计算损失函数对每一层权重和偏置的导数(梯度),可以给予链式法则求梯度。在输出层,输出记为 \(Y\),除了输出层,\(Z\) 代表线性变换结果,\(A\) 代表激活值,在输入层记输入为 \(X\)。
\[
\begin{aligned}
Z_1 &= w_1 X + b_1
\\
A_1 &= \text{relu}(Z_1)
\\
Z_2 &= w_2 A_1 + b_2
\\
A_{2, j} &= \text{softmax}(Z_{2, j}) = \frac{\text{e}^{Z_{2, j}}}{\sum\limits_{k} \text{e}^{Z_{2, k}}}
\end{aligned}
\]
在输出层,损失的计算公式为:
\[
L = - \sum_{i = 1}^{C} (Y_i \times \ln A_{2,i})
\]
其中,\(C\) 是类别。对 \(A_2\) 求导:
\[
\frac{\partial L}{\partial A_2} = - \sum_{i = 1}^{C} (Y_i \times \frac{1}{A_{2,i}})
\]
由 \(A_2 = \text{softmax}(Z_2)\) 得:
\[
\begin{aligned}
\frac{\partial L}{\partial Z_{2}} &= \sum_{j} \sum_{i} \frac{\partial L}{\partial A_{2, i}} \cdot \frac{\partial A_{2, i}}{\partial Z_{2, j}}
\\
\frac{\partial L}{\partial Z_{2}} &= \biggl(- \frac{Y_i}{A_{2, i}}\biggr) \cdot A_{2, i} (1 - A_{2, i}) + \sum_{i \neq j} \biggl(- \frac{Y_j}{A_{2, j}}\biggr) \cdot (-A_{2, i} A_{2, j})
\\
&= -Y_i + A_{2, i} \sum_{j} Y_j
\\
&= A_2 - Y
\end{aligned}
\]
由 \(Z_2 = w_2 A_1 + b_2\) 得:
\[
\begin{aligned}
\frac{\partial L}{\partial w_2} &= \frac{\partial L}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial w_2} = (A_2 - Y) \cdot A_1
\\
\frac{\partial L}{\partial b_2} &= \frac{\partial L}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial b_2} = (A_2 - Y)
\\
\frac{\partial L}{\partial A_1} &= \frac{\partial L}{\partial Z_2} \cdot \frac{\partial Z_2}{\partial A_1} = (A_2 - Y) \cdot w_2
\end{aligned}
\]
由 \(A_1 = \text{relu}(Z_1)\) 得:
\[
\frac{\partial L}{\partial Z_1} = \frac{\partial L}{\partial A_1} \cdot \frac{\partial A_1}{\partial Z_1} = (A_2 - Y) \cdot w_2 \cdot \frac{\partial A_1}{\partial Z_1}
\]
其中,\(\frac{\partial A_1}{\partial Z_1}\) 根据 \(\text{relu}\) 函数的定义易得,见代码部分。
由 \(Z_1 = w_1 X + b_1\) 得类似结果。
推导过程中没有考虑真实的矩阵形状。
结果分析
上述提供了一个 batch GD 算法,可以用其分析小批量与大批量之间的优劣。具体结果如下表所示,从中可以看到小批量 mini-batch GD 思路可以得到较好的结果。
| 10 |
0.9696 |
0.9455 |
| 20 |
0.9615 |
0.9421 |
| 30 |
0.9543 |
0.9354 |
| 40 |
0.9506 |
0.9368 |
| 50 |
0.9423 |
0.9300 |
| 60 |
0.9376 |
0.9195 |
| 70 |
0.9385 |
0.9242 |
| 80 |
0.9382 |
0.9240 |
| 90 |
0.9301 |
0.9191 |
| 500 |
0.8982 |
0.8908 |
| 600 |
0.8913 |
0.8855 |
| 700 |
0.8877 |
0.8896 |
| 800 |
0.8912 |
0.8891 |
| 900 |
0.8872 |
0.8830 |
改进
当前代码没有尝试正则化,下面尝试用 L2 正则化防止机器学习模型过拟合。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| ...existing code... lr = 0.1 num_epochs = 10 batch_size = 64 l2_lambda = 0.01
...existing code... dZ_2 = A_2 - y_onehot dw_2 = np.dot(A_1.T, dZ_2) / batch_size + l2_lambda * w_2 db_2 = np.sum(dZ_2, axis=0, keepdims=True) / batch_size
dA_1 = np.dot(dZ_2, w_2.T) dZ_1 = dA_1 * relu_grad(Z_1) dw_1 = np.dot(X_batch.T, dZ_1) / batch_size + l2_lambda * w_1 db_1 = np.sum(dZ_1, axis=0, keepdims=True) / batch_size ...existing code...
|
相比于没有正则化,二者模型比对结果如下表所示,可以发现,二者最大训练准确率几乎一致,但测试准确率相差近 \(1\%\)。
| 有 |
0.9363 |
0.9370 |
| 无 |
0.9364 |
0.9264 |
接着尝试用 L1 正则化防止机器学习模型过拟合。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| ...existing code... lr = 0.1 num_epochs = 10 batch_size = 64 l1_lambda = 0.01
...existing code... dZ_2 = A_2 - y_onehot dw_2 = np.dot(A_1.T, dZ_2) / batch_size + l1_lambda * np.sign(w_2) db_2 = np.sum(dZ_2, axis=0, keepdims=True) / batch_size
dA_1 = np.dot(dZ_2, w_2.T) dZ_1 = dA_1 * relu_grad(Z_1) dw_1 = np.dot(X_batch.T, dZ_1) / batch_size + l1_lambda * np.sign(w_1) db_1 = np.sum(dZ_1, axis=0, keepdims=True) / batch_size ...existing code...
|
L1 正则化与 L2 正则化在相同参数下结果如下表所示,可见此时 L1 正则化的效果远不及 L2 正则化。
| L2 |
0.9363 |
0.9370 |
| L1 |
0.8463 |
0.8373 |
将学习率设置为 0.001,此时 L1 的性能有所增长,以极小的差距超越无正则化的结果。
参考代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
| import numpy as np import torchvision import random
mnist_train = torchvision.datasets.MNIST(root='~/Datasets/MNIST', train=True, download=True, transform=None) mnist_test = torchvision.datasets.MNIST(root='~/Datasets/MNIST', train=False, download=True, transform=None)
print(mnist_train[0])
X_train = np.array(mnist_train.data) / 255.0 y_train = np.array(mnist_train.targets) X_test = np.array(mnist_test.data) / 255.0 y_test = np.array(mnist_test.targets)
X_train = X_train.reshape(-1, 28 * 28) X_test = X_test.reshape(-1, 28 * 28)
def data_iter(batch_size, features, labels): num_examples = len(features) indices = list(range(num_examples)) random.shuffle(indices) for i in range(0, num_examples, batch_size): j = np.array(indices[i: min(i + batch_size, num_examples)]) yield features[j], labels[j]
input_size = 784 hidden_size = 128 output_size = 10
w_1 = np.random.normal(0, 1, (input_size, hidden_size)) b_1 = np.zeros((1, hidden_size)) w_2 = np.random.normal(0, 1, (hidden_size, output_size)) b_2 = np.zeros((1, output_size))
def relu(x): return np.maximum(0, x)
def relu_grad(x): return (x > 0).astype(float)
def softmax(x): exp_x = np.exp(x - np.max(x, axis=1, keepdims=True)) return exp_x / np.sum(exp_x, axis=1, keepdims=True)
def cross_entropy(y, y_hat): return - np.sum(y * np.log(y_hat)) / len(y)
def one_hot(y, num_classes): y_onehot = np.zeros((y.size, num_classes)) y_onehot[np.arange(y.size), y] = 1 return y_onehot
lr = 0.1 num_epochs = 10 batch_size = 64 l1_lambda = 0.001
def forward(X): Z_1 = np.dot(X, w_1) + b_1 A_1 = relu(Z_1) Z_2 = np.dot(A_1, w_2) + b_2 A_2 = softmax(Z_2) return Z_1, A_1, Z_2, A_2
for epoch in range(1, num_epochs + 1): for X_batch, y_batch in data_iter(batch_size, X_train, y_train): Z_1, A_1, Z_2, A_2 = forward(X_batch)
y_onehot = one_hot(y_batch, output_size)
dZ_2 = A_2 - y_onehot dw_2 = np.dot(A_1.T, dZ_2) / batch_size+ l1_lambda * np.sign(w_2) db_2 = np.sum(dZ_2, axis=0, keepdims=True) / batch_size
dA_1 = np.dot(dZ_2, w_2.T) dZ_1 = dA_1 * relu_grad(Z_1) dw_1 = np.dot(X_batch.T, dZ_1) / batch_size + l1_lambda * np.sign(w_1) db_1 = np.sum(dZ_1, axis=0, keepdims=True) / batch_size
w_1 -= lr * dw_1 b_1 -= lr * db_1 w_2 -= lr * dw_2 b_2 -= lr * db_2
_, _, _, train_pred = forward(X_train) train_loss = cross_entropy(one_hot(y_train, output_size), train_pred) train_acc = np.mean(np.argmax(train_pred, axis=1) == y_train) print(f"Epoch {epoch}, Loss: {train_loss:.4f}, Acc: {train_acc:.4f}")
_, _, _, test_pred = forward(X_test) test_acc = np.mean(np.argmax(test_pred, axis=1) == y_test) print(f"Test Accuracy: {test_acc:.4f}")
|