\(\text{Suppose we have an input: } X \text{ with }n\text{ components and a linear neuron with random weights }W\text{, we can write: }\)
\[\begin{align} Y = W_1X_1+W_2X_2+...+W_nX_n \end{align} \]\(\text{Consider one term: }W_iX_i,\text{ the variance is }\)
\[\begin{align} Var(W_iX_i) = \mathbb{E}(X_i)^2Var(W_i)+\mathbb{E}(W_i)^2Var(X_i)+Var(X_i)Var(W_i) \end{align} \]\(\text{If we assume that the variables are zero-mean, which can be simplified as:}\)
\[\begin{align} Var(W_iX_i) = Var(X_i)Var(W_i) \end{align} \]\(\text{Furthermore, if we assume that }X_i,W_i\text{ are i.i.d, then we can get: }\)
\[\begin{align} Var(Y) &= Var(\sum_iW_iX_i)\\ &=n \cdot Var(W_i)Var(X_i) \end{align} \]\(\large \text{From now we can observe, the output's variance }Var(Y)\text{ is also related to the input's variance, but scaled by }n\cdot Var(W_i). \text{ Therefore, if we want to control our output's variance (e.g. as the same as the input's variance), we need such conditions:}\)
\[Var(W_i) = \frac{1}{n} = \frac{1}{fan\_in} \]\(\text{You can see this is actually the Forward process, if we consider the BackPropagation, we might need:}\)
\[Var(W_i) =\frac{1}{fan\_out} \]\(\text{However, in real NN architectures, it's not common to have same number of inputs and outputs neurons. To compromise this, we take the average of two:}\)
\[\begin{align} Var(W_i) = \frac{2}{fan\_in+fan\_out} \end{align} \]\(\text{In summary, the assumptions we need to derive our results:}\)
\(\text{I. }W,X\text{ are zero-mean}\)
\(\text{II. }W,X\text{ are I.I.D}\)
\(\text{III. Biases are initialized as zeros}\)
\(\text{IV. We use the }\tanh()\) \(\text{ activation function, which is approximately as linear in small inputs: }Var(a^{[l]})\approx Var(z^{[l]})\)
网上很多博客解释的并不好,大多只是介绍结果。这里参考论文来推导一下:
\(\text{From Xavier's results, we know that: }Var[y_l] = n_lVar[w_lx_l]. \text{ Then if }w_l \text{ has zero-mean, we can further obtain: }\)
对于\(ReLU\)函数,\(x_l = \max\{0,y_{l-1}\}\),因此\(\mathbb{E}(x_l^2)\neq Var(x_l)\).
假设\(w_{l-1}\)关于\(0\)对称分布,\(b_{l-1}=0\). 现考虑\(\mathbb{E}(x_l^2)=\mathbb{E}(\max(0,y_{l-1})^2)\). 因此 \(y_{l-1}\)也关于\(0\)对称分布。注意到:
进一步:
\[\begin{align} \mathbb{E}[x_l^2] &=\mathbb{E}[\max(0,y_{l-1})^2]\\ &=\frac{1}{2}\mathbb{E}[y_{l-1}^2]\\ &=\frac{1}{2}Var[y_{l-1}] \end{align} \]因此:
\[\begin{align} Var[y_l] &= \frac{1}{2}n_lVar[w_l]Var[y_{l-1}] \end{align} \]\(\text{Consider all }L\text{ layers: }\)
\[\begin{align} Var[y_l] = Var[y_1](\prod_{l=2}^L\frac{1}{2}n_lVar[w_l]) \end{align} \]A proper initialization method should avoid reducing or magnifying the magnitudes of input signals exponentially.
从这可以很明显地发现,一个充分条件就是:
\[\begin{align} \frac{1}{2}n_lVar[w_l]=1 \end{align} \]This leads to a zero-mean Gaussian distribution whose standard deviation (std) is \(\sqrt{2/n_l}\). This is our way of initialization. We also initialize \(b = 0\).
\(\text{The difference between this Kaiming initialization and the Xavier one is again the 1/2 that comes from the ReLU activation function.}\)