Press "Enter" to skip to content

Tensorflow——手把手教你机器翻译(二)Transformer模型(上)

本站内容均来自兴趣收集,如不慎侵害的您的相关权益,请留言告知,我们将尽快删除.谢谢.

携手创作,共同成长!这是我参与「掘金日新计划 · 8 月更文挑战」的第14天,点击查看活动详情

 

在上一篇文章中,我们实现 基于seq2seq+attention模型的机器翻译 的实现。

 

今天我们来用Transformer模型来实现机器翻译

 

在上一篇文章中的模型思想

 

模型思想——Attention

去除定长的编码瓶颈,信息无损从Encoder传到Decoder

但是

采用GRU,计算依然有瓶颈,并行度不高 ,它都是一个从前往后处理的模型,在前面的词未处理完的时候,后面的词是无法被处理的,因而对于RNN来说,即便是加了attention,它的并行度依然是不够的
只有Encoder和decoder之间有attention ,encoder自身和decoder自身是没有attention的,attention是一种无损的信息传递方式,而encoder自身和decoder自身只能依靠自身的GRU或者LSTM或者是RNN隐含状态来传递信息,而这种传递信息的方式在长距离上会产生信息损失。

比如说一个decoder翻译一个比较长的句子的时候,可能对于原句子中的某个含义已经翻译过了,但是经过了几百个单词之后,因为有信息损失,它不记得这个含义被翻译过了,又会重新翻译这个含义,导致翻译质量下降。

能否去掉RNN?

 

能否给输入和输出分别加上self attention?

 

1.1 Transformer模型

Encoder-Decoder结构

多层Encoder-Decoder
位置编码
多头注意力

缩放点积注意力

Add & norm

1.1.1 模型结构——Encoder-Decoder架构

是一个多层的encoder-decoder架构

多层的有两个含义:encoder,decoder分别是多层的,第二个是encoder的输出要传给decoder的每一个块

模型结构——Encoder:每一个块都是分成两层:self Attention和Feed Forword netural Network,这两层每一层都有add nomorlize

 

模型结构——Attention

缩放点积attention

为什幺要除以根号dk

防止内积总和过大

1.1.2 多头attention

 

 

 

1.1.3 位置编码

 

1.1.4 Add & Normalize

 

1.1.5 模型结构——Decoder

Train的时候并行化
Inference的时候仍然要序列式完成
Self attention时前词不能见后词

mask实现

1.1.6 模型结构——输出

全连接层到词表大小
softmax

2.1 Transform实战

 

实战步骤:

 

# 1. loads data
# 2. preprocesses data -> dataset
# 3. tools
# 3.1 generates position embedding
# 3.2 create mask. (a. padding, b. decoder)
# 3.3 scaled_dot_product_attention
# 4. builds model
# 4.1 MultiheadAttention
# 4.2 EncoderLayer
# 4.3 DecoderLayer
# 4.4 EncoderModel
# 4.5 DecoderModel
# 4.6 Transformer
# 5. optimizer & loss
# 6. train step -> train
# 7. Evaluate and Visualize

2.1.1 载入数据:使用的是tfds中的数据,这个数据是基于subword的,Transformer模型是基本subword做的

import tensorflow_datasets as tfds
examples, info = tfds.load('ted_hrlr_translate/pt_to_en',
                           with_info = True,
                           as_supervised = True)
train_examples, val_examples = examples['train'], examples['validation']
print(info)

 

打印看一下数据集都是什幺样子的:

 

for pt, en in train_examples.take(5):
    print(pt.numpy())
    print(en.numpy())
    print()

 

运行结果:

 

b'e quando melhoramos a procura , tiramos a \xc3\xbanica vantagem da impress\xc3\xa3o , que \xc3\xa9 a serendipidade .'
b'and when you improve searchability , you actually take away the one advantage of print , which is serendipity .'
b'mas e se estes fatores fossem ativos ?'
b'but what if it were active ?'
b'mas eles n\xc3\xa3o tinham a curiosidade de me testar .'
b"but they did n't test for curiosity ."
b'e esta rebeldia consciente \xc3\xa9 a raz\xc3\xa3o pela qual eu , como agn\xc3\xb3stica , posso ainda ter f\xc3\xa9 .'
b'and this conscious defiance is why i , as an agnostic , can still have faith .'
b"`` `` '' podem usar tudo sobre a mesa no meu corpo . ''"
b'you can use everything on the table on me .'

 

在西班牙语中会有一些转义字符

 

2.1.2 从语料中构建subword的方法

 

en_tokenizer = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    (en.numpy() for pt, en in train_examples),
    target_vocab_size = 2 ** 13)
pt_tokenizer = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    (pt.numpy() for pt, en in train_examples),
    target_vocab_size = 2 ** 13)

 

sample_string = "Transformer is awesome."
tokenized_string = en_tokenizer.encode(sample_string)
print('Tokenized string is {}'.format(tokenized_string))
# 把变成subword的词再变回去
origin_string = en_tokenizer.decode(tokenized_string)
print('The original string is {}'.format(origin_string))
assert origin_string == sample_string
for token in tokenized_string:
    print('{} --> "{}"'.format(token, en_tokenizer.decode([token])))

 

运行结果:

 

Tokenized string is [7915, 1248, 7946, 7194, 13, 2799, 7877]
The original string is Transformer is awesome.
7915 --> "T"
1248 --> "ran"
7946 --> "s"
7194 --> "former "
13 --> "is "
2799 --> "awesome"
7877 --> "."

 

2.2 创建数据集:

 

buffer_size = 20000
batch_size = 64
max_length = 40
# 把句子转化成subword之后的数据
def encode_to_subword(pt_sentence, en_sentence):
    pt_sequence = [pt_tokenizer.vocab_size] \
    + pt_tokenizer.encode(pt_sentence.numpy()) \
    + [pt_tokenizer.vocab_size + 1]
    en_sequence = [en_tokenizer.vocab_size] \
    + en_tokenizer.encode(en_sentence.numpy()) \
    + [en_tokenizer.vocab_size + 1]
    return pt_sequence, en_sequence
def filter_by_max_length(pt, en):
    return tf.logical_and(tf.size(pt) <= max_length,
                          tf.size(en) <= max_length)
# 使用py_function把python函数封装起来
def tf_encode_to_subword(pt_sentence, en_sentence):
    return tf.py_function(encode_to_subword,
                          [pt_sentence, en_sentence],
                          [tf.int64, tf.int64])
# 映射:把train_examples中所有的葡萄牙语和英语的句子都转成subword的id
train_dataset = train_examples.map(tf_encode_to_subword)
# 对新的dataset做一个filter
train_dataset = train_dataset.filter(filter_by_max_length)
train_dataset = train_dataset.shuffle(
    buffer_size).padded_batch(
    batch_size, padded_shapes=([-1], [-1]))
# padded_shapes=([-1], [-1]):都在当前维度扩展到最高的值
valid_dataset = val_examples.map(tf_encode_to_subword)
valid_dataset = valid_dataset.filter(
    filter_by_max_length).padded_batch(
    batch_size, padded_shapes=([-1], [-1]))

 

在生成dataset之后再来check一下数据是不是对的

 

for pt_batch, en_batch in valid_dataset.take(5):
    print(pt_batch.shape, en_batch.shape)

 

运行结果:

 

(64, 38) (64, 40)
(64, 39) (64, 35)
(64, 39) (64, 39)
(64, 39) (64, 39)
(64, 39) (64, 36)

 

2.3 写一些工具函数

 

# PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
# PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
# pos.shape: [sentence_length, 1]
# i.shape  : [1, d_model]
# result.shape: [sentence_length, d_model]
# 获取所有的句子位置对应embedding的位置
def get_angles(pos, i, d_model):
    angle_rates = 1 / np.power(10000,
                               (2 * (i // 2)) / np.float32(d_model))
    return pos * angle_rates
# 对奇数位做正弦函数,对偶数位做余弦函数,再将结果拼接起来
def get_position_embedding(sentence_length, d_model):
    angle_rads = get_angles(np.arange(sentence_length)[:, np.newaxis],
                            np.arange(d_model)[np.newaxis, :],
                            d_model)
    # sines.shape: [sentence_length, d_model / 2]
    # cosines.shape: [sentence_length, d_model / 2]
    sines = np.sin(angle_rads[:, 0::2])
    cosines = np.cos(angle_rads[:, 1::2])
    
    # position_embedding.shape: [sentence_length, d_model]
    position_embedding = np.concatenate([sines, cosines], axis = -1)
    # position_embedding.shape: [1, sentence_length, d_model]
    position_embedding = position_embedding[np.newaxis, ...]
    
    return tf.cast(position_embedding, dtype=tf.float32)
position_embedding = get_position_embedding(50, 512)
print(position_embedding.shape)

 

运行结果:

 

(1, 50, 512)

 

def plot_position_embedding(position_embedding):
    plt.pcolormesh(position_embedding[0], cmap = 'RdBu')
    plt.xlabel('Depth')
    plt.xlim((0, 512))
    plt.ylabel('Position')
    plt.colorbar()
    plt.show()
    
plot_position_embedding(position_embedding)

 

运行结果:

2.4 mask构建

# 1. padding mask, 2. look ahead
# batch_data.shape: [batch_size, seq_len]
def create_padding_mask(batch_data):
    padding_mask = tf.cast(tf.math.equal(batch_data, 0), tf.float32)
    # [batch_size, 1, 1, seq_len]
    return padding_mask[:, tf.newaxis, tf.newaxis, :]
x = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])
create_padding_mask(x)

 

运行结果:

 

<tf.Tensor: shape=(3, 1, 1, 5), dtype=float32, numpy=
array([[[[0., 0., 1., 1., 0.]]],
       [[[0., 0., 0., 1., 1.]]],
       [[[1., 1., 1., 0., 0.]]]], dtype=float32)>

 

# attention_weights.shape: [3,3]
# [[1, 0, 0],
#  [4, 5, 0],
#  [7, 8, 9]]
def create_look_ahead_mask(size):
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask # (seq_len, seq_len)
create_look_ahead_mask(3)

 

运行结果:

 

<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[0., 1., 1.],
       [0., 0., 1.],
       [0., 0., 0.]], dtype=float32)>

2.5 缩放点积注意力机制的实现

def scaled_dot_product_attention(q, k, v, mask):
    """
    Args:
    - q: shape == (..., seq_len_q, depth)
    - k: shape == (..., seq_len_k, depth)
    - v: shape == (..., seq_len_v, depth_v)
    - seq_len_k == seq_len_v
    - mask: shape == (..., seq_len_q, seq_len_k)
    Returns:
    - output: weighted sum
    - attention_weights: weights of attention
    """
    
    # matmul_qk.shape: (..., seq_len_q, seq_len_k)
    # transpose_b: 第二个矩阵是否做转置
    matmul_qk = tf.matmul(q, k, transpose_b = True)
    
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
    
    if mask is not None:
        # 使得在softmax后值趋近于0
        scaled_attention_logits += (mask * -1e9)
    
    # attention_weights.shape: (..., seq_len_q, seq_len_k)
    attention_weights = tf.nn.softmax(
        scaled_attention_logits, axis = -1)
    
    # output.shape: (..., seq_len_q, depth_v)
    output = tf.matmul(attention_weights, v)
    
    return output, attention_weights
def print_scaled_dot_product_attention(q, k, v):
    temp_out, temp_att = scaled_dot_product_attention(q, k, v, None)
    print("Attention weights are:")
    print(temp_att)
    print("Output is:")
    print(temp_out)

 

写几个临时的矩阵来测试我们的代码是否正确:

 

temp_k = tf.constant([[10, 0, 0],
                      [0, 10, 0],
                      [0, 0, 10],
                      [0, 0, 10]], dtype=tf.float32) # (4, 3)
temp_v = tf.constant([[1, 0],
                      [10, 0],
                      [100, 5],
                      [1000, 6]], dtype=tf.float32) # (4, 2)
temp_q1 = tf.constant([[0, 10, 0]], dtype=tf.float32) # (1, 3)
np.set_printoptions(suppress=True)
print_scaled_dot_product_attention(temp_q1, temp_k, temp_v)

 

运行结果:

 

Attention weights are:
tf.Tensor([[0. 1. 0. 0.]], shape=(1, 4), dtype=float32)
Output is:
tf.Tensor([[10.  0.]], shape=(1, 2), dtype=float32)

 

Be First to Comment

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注