From b319d4813bc0bc5509a6237a96a1ef377f739fce Mon Sep 17 00:00:00 2001 From: leafee98 Date: Tue, 20 Sep 2022 14:00:13 +0800 Subject: [PATCH] new essay: attention-and-multiattention.md --- .../essays/attention-and-multiattention.md | 46 +++++++++++++++++++ 1 file changed, 46 insertions(+) create mode 100644 content/essays/attention-and-multiattention.md diff --git a/content/essays/attention-and-multiattention.md b/content/essays/attention-and-multiattention.md new file mode 100644 index 0000000..1614088 --- /dev/null +++ b/content/essays/attention-and-multiattention.md @@ -0,0 +1,46 @@ +--- +title: "Attention and Multiattention" +date: 2022-09-19T20:35:33+08:00 +tags: [] +categories: [] +weight: 50 +show_comments: true +katex: true +draft: false +--- + + + +## 自注意力 + +众所周知,注意力就是一个 query 和多个 key-value 对的带权和,如下: + +$$ +Attention(Q, K, V) = V.softmax(score(K, V)) +$$ + +当 Q == K == V 的时候,这个计算就叫做自注意力 + +## 多头注意力 + +假如 Head = 4, 那么如下,其中每一个 Q,K,V 都是完整大小的 Q,K,V,相当于做了 3 次注意力,其中每一个 W 都是可以学习的参数(因为 attention 的计算方法中没有可学习的参数,所以在计算 attention 前加一个线性变换,训练这个线性变换的参数) + +$$ +head_1 = Attention(W_1^QQ, W_1^K, W_1^VV) +\\ +head_2 = Attention(W_2^QQ, W_2^K, W_2^VV) +\\ +head_3 = Attention(W_3^QQ, W_3^K, W_3^VV) +\\ +head_4 = Attention(W_4^QQ, W_4^K, W_4^VV) +$$ + +最后,4 次 attention 结果连接起来,使用另外一个大的线性变换: + +$$ +Multihead(Q, K, V) = W^O[head_1, head_2, head_3, head_4] +$$ + +## 参考 + +https://www.adityaagrawal.net/blog/deep_learning/attention