先分析输入为向量的场合,然后在此基础之上分析输入为批向量(矩阵)的场合
(之后更新输入为矩阵的场合)
我们下文提到的向量都是列向量。小写字母表示向量,大写字母矩阵,带有下标的小(大)写字母表示向量(矩阵)的成员。
其中,
表示矩阵 W 的第 i 个列向量
\[W_{i,-} \]表示矩阵 W 的第 i 个行向量
此外,
\[W^k \]表示第 k 层矩阵 W
无论输入是向量还是矩阵,损失函数 LOSS 的函数值都是标量,以均方误差为例
输入为向量的场合
\[L(W)=\frac{1}{2}(label-y)^T(label-y) \]输入为矩阵的场合
\[L(W)=\frac{\frac{1}{2}(\sum^{b\_size}_{i=1}(LABEL_i-Y_i)^T(LABRL_i-Y_i))}{b\_size} \]首先给出神经网络的相关定义
\[\begin{aligned} &We\ have\ m\ layers,\ in\ each\ layer\ (1<=k<=m)\\ &W^k\ (shape:d^k\times{c^k})\\ &y^{k-1}\ (also\ the\ x^k,\ shape:c^k\times1)\\ &b^k\ (shape:d^k\times1)\\ &n^k=W^ky^{k-1}+b^k\ (shape:d^k\times1)\\ &y^k=F^k(n^k)\ (shape:d^k\times1) \end{aligned} \]神经网络 BP 的核心就是计算损失函数 LOSS 对各层权重矩阵 W 的梯度
\[\frac{dL(W)}{dW^k_{i,j}}=\frac{dL(W)}{dn^k_i}\times\frac{dn^k_i}{dW^k_{i,j}}\\ \\ \because\frac{dn^k_i}{dW^k_{i,j}}=y^{k-1}_j\\ \\ \therefore\frac{dL(W)}{dW^k_{i,j}}=\frac{dL(W)}{dn^k_i}\times{y^{k-1}_j}\\ \\ \frac{dL(W)}{dW^k}= \begin{pmatrix} \frac{dL(W)}{dn^k_1}\times{y^{k-1}_1},...,\frac{L(W)}{dn^k_1}\times{y^{k-1}_{c^k}}\\ ...\\ \frac{dL(W)}{dn^k_{d^k}}\times{y^{k-1}_1},...,\frac{dL(W)}{dn^k_{d^k}}\times{y^{k-1}_{c^k}} \end{pmatrix} \\= \begin{pmatrix} \frac{dL(W)}{dn^k_1}\\ ...\\ \frac{dL(W)}{dn^k_{d^k}} \end{pmatrix}· (y^{k-1}_1,...,y^{k-1}_{c^k})\\ =\frac{dL(W)}{dn^k}·(y^{k-1})^T \]通过上一过程,将 LOSS 对各层权重矩阵 W 的梯度转化为 LOSS 对各层净输入向量 n 的梯度。并在下一过程中,把对第 k 层净输入向量 n 的梯度转化为对第 k + 1 层净输入向量 n 的梯度
\[\frac{dL(W)}{dn^k}(shape:d^k\times{1})=\frac{d(n^{k+1})^T}{dn^k}(size:d^k\times{d^{k+1}})\frac{dL(W)}{dn^{k+1}}(size:d^{k+1}\times1)\\ \\ \frac{d(n^{k+1})^T}{dn^k}\ is\ a\ jacobi\ matrix,\ let\ J\ as\ \frac{d(n^{k+1})^T}{dn^k} \\ J_{i,j}=\frac{dn^{k+1}_{j}}{dn^k_{i}}(watch\ out\ the\ sequence\ of\ i\ and\ j)\\=\frac{d(W^{k+1}_{j,-}y^{k}+b^{k+1})}{dn^k_i}\\ =\frac{d(W^{k+1}_{j,i}\times{y^k_i})}{dn^k_i}\\ =W^{k+1}_{j,i}\times\frac{dy^k_i}{dn^k_i}\\ \\ J= \begin{pmatrix} W^{k+1}_{1,1}\times\frac{dy^k_1}{dn^k_1},...,W^{k+1}_{d^{k+1},1}\times\frac{dy^k_1}{dn^k_1}\\ ...\\ W^{k+1}_{1,d^k}\times\frac{dy^k_{d^k}}{dn^k_{d_k}},...,W^{k+1}_{d^{k+1},d^k}\times\frac{dy^k_{d^k}}{dn^k_{d^k}} \end{pmatrix}\\ =(W^{k+1}(shape:{d^{k+1}}\times{d^k(eql\ c^{k+1})})· \begin{pmatrix} \frac{dy^k_1}{dn^k_1},0,...,0\\ 0,\frac{dy^k_2}{dn^k_2},...,0\\ ...\\ 0,0,...,\frac{dy^k_{d^k}}{dn^k_{d^k}} \end{pmatrix}(shape:d^k\times{d^k}))^T\\ = \begin{pmatrix} \frac{dy^k_1}{dn^k_1},0,...,0\\ 0,\frac{dy^k_2}{dn^k_2},...,0\\ ...\\ 0,0,...,\frac{dy^k_{d^k}}{dn^k_{d^k}} \end{pmatrix}\times (W^{k+1})^T \]定义矩阵 G 来表示上一过程中最后一步出现的对角阵
\[\because\ y^k_i=(F^k(n^k))_i\\ let\ f^k(n^k_i)=({(F^k)}^{'}(n^k))_i=\frac{dy^k_i}{dn^k_i}\\ then\ \begin{pmatrix} \frac{dy^k_1}{dn^k_1},0,...,0\\ 0,\frac{dy^k_2}{dn^k_2},...,0\\ ...\\ 0,0,...,\frac{dy^k_{d^k}}{dn^k_{d^k}} \end{pmatrix}= \begin{pmatrix} f^k(n^k_1),0,0,...,0\\ 0,f^k(n^k_2),0,...,0\\ ...\\ 0,0,0,...,f^k(n^k_{d^k}) \end{pmatrix}\\ let\ G^k=\begin{pmatrix} f(n^k_1),0,0,...,0\\ 0,f(n^k_2),0,...,0\\ ...\\ 0,0,0,...,f(n^k_{d^k}) \end{pmatrix}\\ then\ \frac{dL(W)}{dn^k}=\frac{d(n^{k+1})^T}{dn^k}·\frac{dL(W)}{dn^{k+1}}\\ =G^{k}·(W^{k+1})^T·\frac{dL(W)}{dn^{k+1}}\\ therefore,\ we\ can\ get\ the\ value\ of\ \frac{dL(W)}{dn^k}\ from\ k=m\\ \]最后,我们计算 LOSS 对最后一层净输入向量 n 的梯度
\[\frac{dL(W)}{dn^m}=\frac{d\frac{1}{2}(label-y^{m})^T(label-y^{m})\ (shape:scalar)}{dn^m\ (shape:d^m\times1)}(reslust\ shape:d^m\times1)\\ =\frac{d\frac{1}{2}\sum_{i=1}^{d^m}(label_i-y^m_i)^2}{dn^m}\\ =\frac{d\sum_{i=1}^{d^m}-1\times{label_i}\times{y^m_i}+d\frac{1}{2}\sum_{i=1}^{d^m}(y^m_i)^2}{dn^m}\\ = \begin{pmatrix} \frac{d-1\times{label_1\times{y^m_1}}}{dn^m_1}\\ \frac{d-1\times{label_2\times{y^m_2}}}{dn^m_2}\\ ...\\ \frac{d-1\times{label_{d^m}}\times{y^m_{d^m}}}{dn^m_{d^m}} \end{pmatrix}+ \begin{pmatrix} \frac{y^m_1dy^m_1}{dn^m_1}\\ \frac{y^m_2dy^m_2}{dn^m_2}\\ ...\\ \frac{y^m_{d^m}dy^m_{d^m}}{dn^m_{d^m}} \end{pmatrix}\\ = \begin{pmatrix} \frac{dy^m_1}{dn^m_1},0,0,...,0\\ 0,\frac{dy^m_2}{dn^m_2},0,...,0\\ ...\\ 0,0,0,...,\frac{dy^m_{d^m}}{dn^m_{d^m}} \end{pmatrix} \begin{pmatrix} y^m_1-label_1\\ y^m_2-label_2\\ ...\\ y^m_{d^m}-label_{d^m} \end{pmatrix}\\ =G^m·(y^m-label) \]我们可以根据上一个过程的结果来计算任意层的“LOSS 对该层净输入向量 n 的梯度”
\[\begin{aligned} \frac{dL(W)}{dn^k}&=G^k·(W^{k+1})^T·\frac{dL(W)}{dn^{k+1}}\\ &=G^k·(W^{k+1})^T·G^{k+1}·(W^{k+2})^T·\frac{dL(W)}{dn^{k+3}}\\ &=G^k·(W^{k+1})^T...G^{m-1}·(W^m)^T·G^m·(y^m-label) \end{aligned} \]下面计算对于 LOSS 对偏移向量 b 的梯度,其实就是 LOSS 对净输入向量 n 的梯度
\[\frac{dL(W)}{db^k_i}=\frac{dL(W)}{dn^k_i}\times\frac{dn^k_i}{db^k_i}\\ =\frac{dL(W)}{dn^k_i}\times1\\ \therefore\frac{dL(W)}{db^k}=\frac{dL(W)}{dn^k} \]数学的准备工作已经完成,下面用伪代码描述 BP 的全过程(单次迭代)
Input: 输入向量 x, 标签向量 label, 层数 m, 各层激活函数 F, 各层输入输出规模 (c, d), 学习率 lr OUTPUT: 各层权重矩阵 W, 各层偏移向量 b FOR i = 1 TO m calculate n[i] calculate y[i] END FOR FOR i = m TO 1 calculate G[i] calculate dL_n[i] calculate dL_W[i] W[i] = w[i] - lr * dL_w[i] b[i] = b[i] - lr * dL_n[i] END FOR