PaperNotes: attention系列 (2) - ANMT

本文主要是介绍PaperNotes: attention系列 (2) - ANMT，对大家解决编程问题具有一定的参考价值，需要的程序猿们随着小编来一起学习吧！

1. paper

Effective Approaches to Attention-based Neural Machine Translation 2015

2. keypoint

提出了 global attention和local attention用于nmt。其中global attention类似soft attention，而local attention是结合了soft attention和hard attention的变形。

3. 简介

此时nmt已经开始使用了，但是没有合适的attentin结构本文就提出了global 和 local attention结构。

4. 模型

4.1 概述

模型选用的RNN单元是LSTM，且用的是多层LSTM结构。

在预测阶段，输入LSMT最后一层输出的 h_t 和同时用attention机制计算出的 c_t ，得到一个注意力隐藏层

\mathbf{\widetilde{h}_t}=tanh(\mathbf{W_c}[\mathbf{c_t;h_t}])

${\widetilde{h}_t}$ 通过一个softmax层得到该词被翻译成某个词的概率。

p(y_t|y_{<t}) = softmax(\mathbf{W_s\widetilde{h}_t})

本文提出了global和local两种模型，最主要就是计算 c_t 不同。在预测时，global是考虑target hidden state h_t 与全局source items $\overline{h}_s$ 的alignment weights。对全局source items和对应的alignment Weights求一个weighted average作为 c_t

而local根据target预测其在source items位置 p_t ，在窗口内的source hiddent states才参与attention和weighted average计算。

4.2 global attention

如Figure2所示。计算 c_t 时需要考虑encoder的全部hidden state。这里的 $\alpha$ 是变长的，因为source target是变长的。

\mathbf{\alpha_t}(s) = align(\mathbf{h_t, \widetilde{h}_s})
=\frac{exp(score(\mathbf{h_t, \overline{h}_s}))}{\sum_{s'}{exp(score(\mathbf{h_t, \overline{h}_{s'}}))}}

score的计算方法有content based。这三个方法本质是一样的。

score(\mathbf{h_t, \overline{h}_s})=
\begin{cases}
\mathbf{h_t^\top\overline{h}_s}&dot \\
\mathbf{h_t^\top W_a\overline{h}_s} & general \\
\mathbf{v_a}^\top tanh(\mathbf{W_a[h_t;\overline{h}_s]})& concat
\end{cases}

以及location based，即attention只与target hidden state有关。

4.3 local attention

global attention需要考虑之前所有的source hidden state，计算耗时很大。对于长文本（比如文章）这是不能接受的。因此提出了local attention。对每一个target item，模型先预测一个 p_i ，给一个窗口参数D，context vector c_t 就是 [p_t-D, p_t+D] 这个区间的source hidden state 的weighted average。注意这里的 attention weight $\alpha$ 是定长的，因为D确定了。

这里 p_i 的计算方法也提出了2种。

单调对齐。和stm模型中词对齐的作用一样。可以简单的设，认为source item和target item是单调对齐的。
预测。
$p_t=S·sigmoid(\mathbf{v}_p^\top tanh(\mathbf{W_ph_t}))$

$\mathbf{W_p}$ 和 $\mathbf{v_p}$ 都是模型参数，S是source item length。为了让对齐的值更靠近 p_t ，这里用均值在 p_t 附近的高斯分布。

\alpha_t(s)=align(\mathbf{h_t, \widetilde{h}_s})exp(-\frac{(s-p_t)^2}{2\sigma^2})

$\sigma=D/2$ (经验值)， p_t 是一个实数，s是一个在 p_t 窗口内的整数。

4.3 input-feeding approch

尽管考虑了attention,但是每次的attention

5. 实验

这篇关于PaperNotes: attention系列 (2) - ANMT的文章就介绍到这儿，希望我们推荐的文章对大家有所帮助，也希望大家多多支持为之网！

人工智能学习