2022-09-20 06:00:13 +00:00
|
|
|
|
---
|
2022-09-20 06:01:40 +00:00
|
|
|
|
title: "自注意力和多头注意力"
|
2022-09-20 06:00:13 +00:00
|
|
|
|
date: 2022-09-19T20:35:33+08:00
|
|
|
|
|
tags: []
|
|
|
|
|
categories: []
|
|
|
|
|
weight: 50
|
|
|
|
|
show_comments: true
|
|
|
|
|
katex: true
|
|
|
|
|
draft: false
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
<!--more-->
|
|
|
|
|
|
|
|
|
|
## 自注意力
|
|
|
|
|
|
|
|
|
|
众所周知,注意力就是一个 query 和多个 key-value 对的带权和,如下:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
Attention(Q, K, V) = V.softmax(score(K, V))
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
当 Q == K == V 的时候,这个计算就叫做自注意力
|
|
|
|
|
|
|
|
|
|
## 多头注意力
|
|
|
|
|
|
|
|
|
|
假如 Head = 4, 那么如下,其中每一个 Q,K,V 都是完整大小的 Q,K,V,相当于做了 3 次注意力,其中每一个 W 都是可以学习的参数(因为 attention 的计算方法中没有可学习的参数,所以在计算 attention 前加一个线性变换,训练这个线性变换的参数)
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
head_1 = Attention(W_1^QQ, W_1^K, W_1^VV)
|
2022-09-20 06:01:40 +00:00
|
|
|
|
\\\\
|
2022-09-20 06:00:13 +00:00
|
|
|
|
head_2 = Attention(W_2^QQ, W_2^K, W_2^VV)
|
2022-09-20 06:01:40 +00:00
|
|
|
|
\\\\
|
2022-09-20 06:00:13 +00:00
|
|
|
|
head_3 = Attention(W_3^QQ, W_3^K, W_3^VV)
|
2022-09-20 06:01:40 +00:00
|
|
|
|
\\\\
|
2022-09-20 06:00:13 +00:00
|
|
|
|
head_4 = Attention(W_4^QQ, W_4^K, W_4^VV)
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
最后,4 次 attention 结果连接起来,使用另外一个大的线性变换:
|
|
|
|
|
|
|
|
|
|
$$
|
|
|
|
|
Multihead(Q, K, V) = W^O[head_1, head_2, head_3, head_4]
|
|
|
|
|
$$
|
|
|
|
|
|
|
|
|
|
## 参考
|
|
|
|
|
|
|
|
|
|
https://www.adityaagrawal.net/blog/deep_learning/attention
|