## 图解实现一个self-attention

self attention 是很多bert框架，或者说是transformer框架的共同点，它的提出颠覆了之前在进行“语言理解”任务中深度神经网络基本都要使用RNN进行循环串行计算的弊端，取而代之的是获取全局视野，并行计算全局注意力。

（这部分参考参考https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a一文）

```input = [
[1., 0., 1., 0.],
[0., 2., 0., 2.],
[1., 1., 1., 1.]
]```

q、k、v的w分别为：

```w_key = [
[0, 0, 1],
[1, 1, 0],
[0, 1, 0],
[1, 1, 0]
]
w_query = [
[1, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 1, 1]
]
w_value = [
[0, 2, 0],
[0, 3, 0],
[1, 0, 3],
[1, 1, 0]
]```

```keys = np.dot(input, w_key) #[[0,1,1],[4,0,0],[2,3,1]]
querys = np.dot(input, w_query) #[[1,0,2],[2,2,2],[2,1,3]]
values = np.dot(input, w_value) #[[1,2,3],[2,8,0],[2,6,3]]```

```attn_scores = np.dot(querys, keys.T)
attn_scores_softmax = tf.nn.softmax(attn_scores)# [[0,0.5,0.5],[0,1,0],[0,0.9,0.1]]```

```weighted_values = values[:, None] * attn_scores_softmax.T[:,:,None]
# attn_scores_softmax.T[:,:,None] =
# [[[0. ]
#  [0. ]
#  [0. ]]
# [[0.5]
#  [1. ]
#  [0.9]]
# [[0.5]
#  [0. ]
#  [0.1]]]
# values[:, None] =
#[[[1 2 3]]
# [[2 8 0]]
# [[2 6 3]]]```

`outputs = weighted_values.sum(axis=0) # [[2,7,1.5],[2,8,0],[2,7.8,0.3]]`