本站内容均来自兴趣收集,如不慎侵害的您的相关权益,请留言告知,我们将尽快删除.谢谢.
携手创作,共同成长!这是我参与「掘金日新计划 · 8 月更文挑战」的第14天,点击查看活动详情
在上一篇文章中,我们实现 基于seq2seq+attention模型的机器翻译 的实现。
今天我们来用Transformer模型来实现机器翻译
在上一篇文章中的模型思想
模型思想——Attention
去除定长的编码瓶颈,信息无损从Encoder传到Decoder
但是
采用GRU,计算依然有瓶颈,并行度不高 ,它都是一个从前往后处理的模型,在前面的词未处理完的时候,后面的词是无法被处理的,因而对于RNN来说,即便是加了attention,它的并行度依然是不够的
只有Encoder和decoder之间有attention ,encoder自身和decoder自身是没有attention的,attention是一种无损的信息传递方式,而encoder自身和decoder自身只能依靠自身的GRU或者LSTM或者是RNN隐含状态来传递信息,而这种传递信息的方式在长距离上会产生信息损失。
比如说一个decoder翻译一个比较长的句子的时候,可能对于原句子中的某个含义已经翻译过了,但是经过了几百个单词之后,因为有信息损失,它不记得这个含义被翻译过了,又会重新翻译这个含义,导致翻译质量下降。
能否去掉RNN?
能否给输入和输出分别加上self attention?
1.1 Transformer模型
Encoder-Decoder结构
多层Encoder-Decoder
位置编码
多头注意力
缩放点积注意力
Add & norm
1.1.1 模型结构——Encoder-Decoder架构
是一个多层的encoder-decoder架构
多层的有两个含义:encoder,decoder分别是多层的,第二个是encoder的输出要传给decoder的每一个块
模型结构——Encoder:每一个块都是分成两层:self Attention和Feed Forword netural Network,这两层每一层都有add nomorlize
模型结构——Attention
缩放点积attention
为什幺要除以根号dk
防止内积总和过大
1.1.2 多头attention
1.1.3 位置编码
1.1.4 Add & Normalize
1.1.5 模型结构——Decoder
Train的时候并行化
Inference的时候仍然要序列式完成
Self attention时前词不能见后词
mask实现
1.1.6 模型结构——输出
全连接层到词表大小
softmax
2.1 Transform实战
实战步骤:
# 1. loads data # 2. preprocesses data -> dataset # 3. tools # 3.1 generates position embedding # 3.2 create mask. (a. padding, b. decoder) # 3.3 scaled_dot_product_attention # 4. builds model # 4.1 MultiheadAttention # 4.2 EncoderLayer # 4.3 DecoderLayer # 4.4 EncoderModel # 4.5 DecoderModel # 4.6 Transformer # 5. optimizer & loss # 6. train step -> train # 7. Evaluate and Visualize
2.1.1 载入数据:使用的是tfds中的数据,这个数据是基于subword的,Transformer模型是基本subword做的
import tensorflow_datasets as tfds examples, info = tfds.load('ted_hrlr_translate/pt_to_en', with_info = True, as_supervised = True) train_examples, val_examples = examples['train'], examples['validation'] print(info)
打印看一下数据集都是什幺样子的:
for pt, en in train_examples.take(5): print(pt.numpy()) print(en.numpy()) print()
运行结果:
b'e quando melhoramos a procura , tiramos a \xc3\xbanica vantagem da impress\xc3\xa3o , que \xc3\xa9 a serendipidade .' b'and when you improve searchability , you actually take away the one advantage of print , which is serendipity .' b'mas e se estes fatores fossem ativos ?' b'but what if it were active ?' b'mas eles n\xc3\xa3o tinham a curiosidade de me testar .' b"but they did n't test for curiosity ." b'e esta rebeldia consciente \xc3\xa9 a raz\xc3\xa3o pela qual eu , como agn\xc3\xb3stica , posso ainda ter f\xc3\xa9 .' b'and this conscious defiance is why i , as an agnostic , can still have faith .' b"`` `` '' podem usar tudo sobre a mesa no meu corpo . ''" b'you can use everything on the table on me .'
在西班牙语中会有一些转义字符
2.1.2 从语料中构建subword的方法
en_tokenizer = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus( (en.numpy() for pt, en in train_examples), target_vocab_size = 2 ** 13) pt_tokenizer = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus( (pt.numpy() for pt, en in train_examples), target_vocab_size = 2 ** 13)
sample_string = "Transformer is awesome." tokenized_string = en_tokenizer.encode(sample_string) print('Tokenized string is {}'.format(tokenized_string)) # 把变成subword的词再变回去 origin_string = en_tokenizer.decode(tokenized_string) print('The original string is {}'.format(origin_string)) assert origin_string == sample_string for token in tokenized_string: print('{} --> "{}"'.format(token, en_tokenizer.decode([token])))
运行结果:
Tokenized string is [7915, 1248, 7946, 7194, 13, 2799, 7877] The original string is Transformer is awesome. 7915 --> "T" 1248 --> "ran" 7946 --> "s" 7194 --> "former " 13 --> "is " 2799 --> "awesome" 7877 --> "."
2.2 创建数据集:
buffer_size = 20000 batch_size = 64 max_length = 40 # 把句子转化成subword之后的数据 def encode_to_subword(pt_sentence, en_sentence): pt_sequence = [pt_tokenizer.vocab_size] \ + pt_tokenizer.encode(pt_sentence.numpy()) \ + [pt_tokenizer.vocab_size + 1] en_sequence = [en_tokenizer.vocab_size] \ + en_tokenizer.encode(en_sentence.numpy()) \ + [en_tokenizer.vocab_size + 1] return pt_sequence, en_sequence def filter_by_max_length(pt, en): return tf.logical_and(tf.size(pt) <= max_length, tf.size(en) <= max_length) # 使用py_function把python函数封装起来 def tf_encode_to_subword(pt_sentence, en_sentence): return tf.py_function(encode_to_subword, [pt_sentence, en_sentence], [tf.int64, tf.int64]) # 映射:把train_examples中所有的葡萄牙语和英语的句子都转成subword的id train_dataset = train_examples.map(tf_encode_to_subword) # 对新的dataset做一个filter train_dataset = train_dataset.filter(filter_by_max_length) train_dataset = train_dataset.shuffle( buffer_size).padded_batch( batch_size, padded_shapes=([-1], [-1])) # padded_shapes=([-1], [-1]):都在当前维度扩展到最高的值 valid_dataset = val_examples.map(tf_encode_to_subword) valid_dataset = valid_dataset.filter( filter_by_max_length).padded_batch( batch_size, padded_shapes=([-1], [-1]))
在生成dataset之后再来check一下数据是不是对的
for pt_batch, en_batch in valid_dataset.take(5): print(pt_batch.shape, en_batch.shape)
运行结果:
(64, 38) (64, 40) (64, 39) (64, 35) (64, 39) (64, 39) (64, 39) (64, 39) (64, 39) (64, 36)
2.3 写一些工具函数
# PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) # PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)) # pos.shape: [sentence_length, 1] # i.shape : [1, d_model] # result.shape: [sentence_length, d_model] # 获取所有的句子位置对应embedding的位置 def get_angles(pos, i, d_model): angle_rates = 1 / np.power(10000, (2 * (i // 2)) / np.float32(d_model)) return pos * angle_rates # 对奇数位做正弦函数,对偶数位做余弦函数,再将结果拼接起来 def get_position_embedding(sentence_length, d_model): angle_rads = get_angles(np.arange(sentence_length)[:, np.newaxis], np.arange(d_model)[np.newaxis, :], d_model) # sines.shape: [sentence_length, d_model / 2] # cosines.shape: [sentence_length, d_model / 2] sines = np.sin(angle_rads[:, 0::2]) cosines = np.cos(angle_rads[:, 1::2]) # position_embedding.shape: [sentence_length, d_model] position_embedding = np.concatenate([sines, cosines], axis = -1) # position_embedding.shape: [1, sentence_length, d_model] position_embedding = position_embedding[np.newaxis, ...] return tf.cast(position_embedding, dtype=tf.float32) position_embedding = get_position_embedding(50, 512) print(position_embedding.shape)
运行结果:
(1, 50, 512)
def plot_position_embedding(position_embedding): plt.pcolormesh(position_embedding[0], cmap = 'RdBu') plt.xlabel('Depth') plt.xlim((0, 512)) plt.ylabel('Position') plt.colorbar() plt.show() plot_position_embedding(position_embedding)
运行结果:
2.4 mask构建
# 1. padding mask, 2. look ahead # batch_data.shape: [batch_size, seq_len] def create_padding_mask(batch_data): padding_mask = tf.cast(tf.math.equal(batch_data, 0), tf.float32) # [batch_size, 1, 1, seq_len] return padding_mask[:, tf.newaxis, tf.newaxis, :] x = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]]) create_padding_mask(x)
运行结果:
<tf.Tensor: shape=(3, 1, 1, 5), dtype=float32, numpy= array([[[[0., 0., 1., 1., 0.]]], [[[0., 0., 0., 1., 1.]]], [[[1., 1., 1., 0., 0.]]]], dtype=float32)>
# attention_weights.shape: [3,3] # [[1, 0, 0], # [4, 5, 0], # [7, 8, 9]] def create_look_ahead_mask(size): mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0) return mask # (seq_len, seq_len) create_look_ahead_mask(3)
运行结果:
<tf.Tensor: shape=(3, 3), dtype=float32, numpy= array([[0., 1., 1.], [0., 0., 1.], [0., 0., 0.]], dtype=float32)>
2.5 缩放点积注意力机制的实现
def scaled_dot_product_attention(q, k, v, mask): """ Args: - q: shape == (..., seq_len_q, depth) - k: shape == (..., seq_len_k, depth) - v: shape == (..., seq_len_v, depth_v) - seq_len_k == seq_len_v - mask: shape == (..., seq_len_q, seq_len_k) Returns: - output: weighted sum - attention_weights: weights of attention """ # matmul_qk.shape: (..., seq_len_q, seq_len_k) # transpose_b: 第二个矩阵是否做转置 matmul_qk = tf.matmul(q, k, transpose_b = True) dk = tf.cast(tf.shape(k)[-1], tf.float32) scaled_attention_logits = matmul_qk / tf.math.sqrt(dk) if mask is not None: # 使得在softmax后值趋近于0 scaled_attention_logits += (mask * -1e9) # attention_weights.shape: (..., seq_len_q, seq_len_k) attention_weights = tf.nn.softmax( scaled_attention_logits, axis = -1) # output.shape: (..., seq_len_q, depth_v) output = tf.matmul(attention_weights, v) return output, attention_weights def print_scaled_dot_product_attention(q, k, v): temp_out, temp_att = scaled_dot_product_attention(q, k, v, None) print("Attention weights are:") print(temp_att) print("Output is:") print(temp_out)
写几个临时的矩阵来测试我们的代码是否正确:
temp_k = tf.constant([[10, 0, 0], [0, 10, 0], [0, 0, 10], [0, 0, 10]], dtype=tf.float32) # (4, 3) temp_v = tf.constant([[1, 0], [10, 0], [100, 5], [1000, 6]], dtype=tf.float32) # (4, 2) temp_q1 = tf.constant([[0, 10, 0]], dtype=tf.float32) # (1, 3) np.set_printoptions(suppress=True) print_scaled_dot_product_attention(temp_q1, temp_k, temp_v)
运行结果:
Attention weights are: tf.Tensor([[0. 1. 0. 0.]], shape=(1, 4), dtype=float32) Output is: tf.Tensor([[10. 0.]], shape=(1, 2), dtype=float32)
Be First to Comment