原谅我写中文太累了,而且相信在座的都有一定的英文水平。
考虑某个未知分布 $p(x)$ ,假定已经使用一个近似的分布 $q(x)$ 对它进行建模。如果使用 $q(x)$ 来建立一个编码体系,用来把 $x$ 的值传给接收者,那么,由于使用 了 $q(x) $ 而不是真实分布 $ p(x) $ ,因此在具体化 $x$ 的值时,需要一些附加的信息。我们需要的平均的附加信息量(单位是nat)为:
$\begin{aligned}D_{K L}(p \| q) &=-\int p(x) \log q(x)-(-\int p(x) \log p(x)) \\&=-\int p(x) \log \frac{q(x)}{p(x)} d x\end{aligned}$
得:
$D_{K L}(p \| q)=-\int p(X) \log \frac{q(X)}{p(X)} d X$
$D_{K L}(q \| p)=-\int q(X) \log \frac{p(X)}{q(X)} d X$
KL散度在一定程度上衡量了两个分布的差异,具有如下性质:
给定 $\alpha \in \mathbb{R}$ ,$\alpha$ 散度 可以被定义为
$\frac{1}{\alpha(1-\alpha)}\left(1-\sum\limits _{x} p_{2}(x)\left(\frac{p_{1}(x)}{p_{2}(x)}\right)^{\alpha}\right)$
KL散度是$\alpha$ 散度的一个特例,$K L\left(P_{1}, P_{2}\right)$ , $K L\left(P_{2}, P_{1}\right)$ 分别对应 $ \alpha=1, \alpha=0$,且 $\alpha \neq 0,1 $。
The Amari divergence come from the above by the transformation $\alpha=\frac{1+t}{2}$.
为构造出对称的形式,可以将两种 KL 散度结合起来,就是 JS 散度(Jensen-Shannon散度),表达式如下:
$D_{J S}(p \| q)=\frac{1}{2} D_{K L}\left(p \| \frac{p+q}{2}\right)+\frac{1}{2} D_{K L}\left(q \| \frac{p+q}{2}\right)$
性质:
Given a convex function $f(t): \mathbb{R}_{\geq 0} \rightarrow \mathbb{R}$ with f$(1)=0$, $f^{\prime}(1)=0$, $f^{\prime \prime}(1)= 1$ , the $f$ -divergence on $\mathcal{P}$ is defined by
$\sum \limits _{x} p_{2}(x) f\left(\frac{p_{1}(x)}{p_{2}(x)}\right)$
The harmonic mean similarity is a similarity on $\mathcal{P}$ defined by
$2 \sum \limits _{x} \frac{p_{1}(x) p_{2}(x)}{p_{1}(x)+p_{2}(x)} .$
The fidelity similarity (or Bhattacharya coefficient, Hellinger affinity) on $\mathcal{P}$ is
$\rho\left(P_{1}, P_{2}\right)=\sum_{x} \sqrt{p_{1}(x) p_{2}(x)} .$
In terms of the fidelity similarity $\rho$ , the Hellinger metric (or Matusita distance, Hellinger-Kakutani metric) on $\mathcal{P}$ is defined by
$\left(\sum\limits_{x}\left(\sqrt{p_{1}(x)}-\sqrt{p_{2}(x)}\right)^{2}\right)^{\frac{1}{2}}=\sqrt{2\left(1-\rho\left(P_{1}, P_{2}\right)\right)}$
In terms of the fidelity similarity $\rho$ , the Bhattacharya distance 1 (1946) is
$\left(\arccos \rho\left(P_{1}, P_{2}\right)\right)^{2} $
for $P_{1}, P_{2} \in \mathcal{P}$ . Twice this distance is the Rao distance . It is used also in Statistics and Machine Learning, where it is called the Fisher distance.
The Bhattacharya distance 2(1943) on $\mathcal{P}$ is defined by
$-\ln \rho\left(P_{1}, P_{2}\right)$
The $\chi^{2}$ -distance (or Pearson $\chi^{2} $-distance) is a quasi-distance on $\mathcal{P}$ , defined by
$\sum_{x} \frac{\left(p_{1}(x)-p_{2}(x)\right)^{2}}{p_{2}(x)}$
The Neyman $\chi^{2}$ -distance is a quasi-distance on $\mathcal{P} $, defined by
$\sum_{x} \frac{\left(p_{1}(x)-p_{2}(x)\right)^{2}}{p_{1}(x)} .$
The half of $\chi^{2}$ -distance is also called Kagan's divergence.
The probabilistic symmetric $\chi^{2}$ -measure is a distance on $\mathcal{P} $, defined by
$2 \sum_{x} \frac{\left(p_{1}(x)-p_{2}(x)\right)^{2}}{p_{1}(x)+p_{2}(x)} .$
由于我暂时用不到剩下的,所以没写。
本文参考 《 Encyclopedia of Distances》,需要电子书的联系博主。
Distances on Distribution Laws ..................................... 261
同时参考了另外一个”借鉴“者的博客《机器学习中的数学》