在给定如下图像的情况下,我们的目标是生成字幕,例如 “踏浪前行的冲浪者”。
图像来源及许可:公共领域
在此示例中,我们将使用基于注意力机制的模型。这样我们就能看到模型在生成字幕时重点关注的图像区域。
下面这种模型架构与 《显示、加入及识别:使用视觉注意力机制生成神经图像字幕》 (Show, Attend and Tell: Neural Image Caption Generation with Visual Attention) 一文中的架构类似 (https://arxiv.org/abs/1502.03044) 。
此笔记是一个端到端示例。运行时,该笔记会下载 MS-COCO 数据集、使用 Inception V3 对图像子集进行预处理和缓存、训练编码器-解码器模型,并使用该模型在新图像上生成字幕。
在此示例中,您将使用相对较少的数据量训练模型。该模型将针对前 30000 条字幕(对应大约 20000 张打乱次序的图像,因为数据集中的每张图像都有多条字幕)进行训练。
from __future__ import absolute_import, division, print_function, unicode_literals
!pip install -q tensorflow==2.0.0-alpha0
import tensorflow as tf
# 我们将生成注意力分布图,以查看我们的模型
# 在生成字幕期间重点关注哪些图像区域
import matplotlib.pyplot as plt
# Scikit-learn 包含许多有用的实用工具
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
import re
import numpy as np
import os
import time
import json
from glob import glob
from PIL import Image
import pickle
下载和准备 MS-COCO 数据集
我们将使用 MS-COCO 数据集 来训练模型。此数据集包含超过 82000 张图像,每张图像都至少标有 5 条不同的字幕。以下代码将自动下载和提取数据集 (http://cocodataset.org/?hl=zh-CN#home) 。
注意:将开始下载大量数据。我们将使用此训练集,文件大小为 13GB。
annotation_zip = tf.keras.utils.get_file(‘captions.zip’,
cache_subdir=os.path.abspath(‘.’),
origin = ‘http://images.cocodataset.org/annotations/annotations_trainval2014.zip’,
extract = True)
annotation_file = os.path.dirname(annotation_zip)+’/annotations/captions_train2014.json’
name_of_zip = ‘train2014.zip’
if not os.path.exists(os.path.abspath(‘.’) + ‘/’ + name_of_zip):
image_zip = tf.keras.utils.get_file(name_of_zip,
cache_subdir=os.path.abspath(‘.’),
origin = ‘http://images.cocodataset.org/zips/train2014.zip’,
extract = True)
PATH = os.path.dirname(image_zip)+’/train2014/’
else:
PATH = os.path.abspath(‘.’)+’/train2014/’
您可以选择限制训练集大小,以提升训练速度
在此示例中,我们将选择包含 30000 条字幕的子集,并使用这些字幕及对应的图像来训练模型。与往常一样,您选择使用的数据越多,字幕质量就越高。
# 读取 json 文件
with open(annotation_file, ‘r’) as f:
annotations = json.load(f)
# 以向量形式存储字幕和图像名称
all_captions = []
all_img_name_vector = []
for annot in annotations[‘annotations’]:
caption = ‘<start> ‘ + annot[‘caption’] + ‘ <end>’
image_id = annot[‘image_id’]
full_coco_image_path = PATH + ‘COCO_train2014_’ + ‘%012d.jpg’ % (image_id)
all_img_name_vector.append(full_coco_image_path)
all_captions.append(caption)
# 同时打乱字幕和 image_names 的次序
# 设置随机状态
train_captions, img_name_vector = shuffle(all_captions,
all_img_name_vector,
random_state=1)
# 从已打乱次序的数据集中选择前 30000 条字幕
num_examples = 30000
train_captions = train_captions[:num_examples]
img_name_vector = img_name_vector[:num_examples]
len(train_captions), len(all_captions)
(30000, 414113)
使用 InceptionV3 对图像进行预处理
接下来,我们将使用 InceptionV3(已在 Imagenet 上完成预训练)对每张图像进行分类。我们将从最后一个卷积层中提取特征。
首先,我们需要将图像转换为 InceptionV3 所期望的格式,方法如下:* 将图像大小重新调整为 (299, 299) * 使用 preprocess_input 方法将像素范围设置为 -1 到 1(以匹配训练 InceptionV3 时所用的图像格式)。
def load_image(image_path):
img = tf.io.read_file(image_path)
img = tf.image.decode_jpeg(img, channels=3)
img = tf.image.resize(img, (299, 299))
img = tf.keras.applications.inception_v3.preprocess_input(img)
return img, image_path
初始化 InceptionV3 并加载经过预训练的 Imagenet 权重
为此,我们将创建 tf.keras 模型,其中输出层是 InceptionV3 架构的最后一个卷积层。* 每张图像通过网络向前传递,且我们最终得到的向量将存入字典 (image_name –> feature_vector)。* 之所以使用最后一个卷积层,是因为我们在此示例中采用了注意力机制。该层的输出形状为 8x8x2048。* 我们在训练期间会避免这种做法,以免造成瓶颈。* 在所有图像均通过网络传递后,我们会 pickle 字典并将其保存到磁盘中。
image_model = tf.keras.applications.InceptionV3(include_top=False,
weights=’imagenet’)
new_input = image_model.input
hidden_layer = image_model.layers[-1].output
image_features_extract_model = tf.keras.Model(new_input, hidden_layer)
缓存从 InceptionV3 中提取的特征
我们将使用 InceptionV3 对每张图像进行预处理,然后将输出结果缓存至磁盘。在 RAM 中缓存输出结果可能会更快,但此方法需要大量内存(每张图像需要 8 * 8 * 2048 个浮点)。在编写时,这将超出 Colab 的内存限制(尽管这些限制可能会有所变化,但目前每个实例似乎只有大约 12GB 内存)。
使用更复杂的缓存策略(例如,通过将图像分片来减少随机访问磁盘 I/O 的次数)可以提升性能,但需要更多代码。
使用 GPU 在 Colab 中运行此过程大约需要 10 分钟。如果您想查看进度条,可以安装 tqdm (!pip install tqdm),然后将这行代码从:
for img, path in image_dataset:
更改为:
for img, path in tqdm(image_dataset):
# 获取唯一图像
encode_train = sorted(set(img_name_vector))
# 您可以根据系统配置自行更改 batch_size
image_dataset = tf.data.Dataset.from_tensor_slices(encode_train)
image_dataset = image_dataset.map(
load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE).batch(16)
for img, path in image_dataset:
batch_features = image_features_extract_model(img)
batch_features = tf.reshape(batch_features,
(batch_features.shape[0], -1, batch_features.shape[3]))
for bf, p in zip(batch_features, path):
path_of_feature = p.numpy().decode(“utf-8”)
np.save(path_of_feature, bf.numpy())
对字幕进行预处理和标记
首先,我们会标记字幕(例如,按空格分割)。这样我们便可获取数据中所有唯一字词的词汇表(例如 “冲浪”、“足球” 等)
接着,我们会将词汇表大小限制为前 5000 个字词,以节省内存。我们将以 “UNK”(表示未知内容)标记替换所有其他字词
最后,我们创建字词 –> 进行索引映射,反之亦然
然后,我们会以最长序列的长度填充所有序列
# 这会在数据集中找到任何字幕的最大长度
def calc_max_length(tensor):
return max(len(t) for t in tensor)
# 以上步骤是进行文本处理的一般流程
# 从词汇表中选择前 5000 个字词
top_k = 5000
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k,
oov_token=”<unk>”,
filters=’!”#$%&()*+.,-/:;=?@[\]^_`{|}~ ‘)
tokenizer.fit_on_texts(train_captions)
train_seqs = tokenizer.texts_to_sequences(train_captions)
tokenizer.word_index[‘<pad>’] = 0
tokenizer.index_word[0] = ‘<pad>’
# 创建标记向量
train_seqs = tokenizer.texts_to_sequences(train_captions)
# 将每个向量填充到字幕中,以达到 max_length
# 如果未提供 max_length 参数,则 pad_sequences 会自动计算最大长度
cap_vector = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding=’post’)
# 计算 max_length
# 用于存储注意力权重
max_length = calc_max_length(train_seqs)
将数据分为训练集和测试集
# 按照 80-20 的分割比例创建训练集和验证集
img_name_train, img_name_val, cap_train, cap_val = train_test_split(img_name_vector,
cap_vector,
test_size=0.2,
random_state=0)
len(img_name_train), len(cap_train), len(img_name_val), len(cap_val)
(24000, 24000, 6000, 6000)
图像和字幕已准备就绪!接下来,我们要创建用于训练模型的 tf.data 数据集。
# 您可以根据系统配置自行更改这些参数
BATCH_SIZE = 64
BUFFER_SIZE = 1000
embedding_dim = 256
units = 512
vocab_size = len(tokenizer.word_index) + 1
num_steps = len(img_name_train) // BATCH_SIZE
# 从 InceptionV3 中提取的向量形状为 (64, 2048)
# 这两个变量分别代表
features_shape = 2048
attention_features_shape = 64
# 加载 numpy 文件
def map_func(img_name, cap):
img_tensor = np.load(img_name.decode(‘utf-8′)+’.npy’)
return img_tensor, cap
dataset = tf.data.Dataset.from_tensor_slices((img_name_train, cap_train))
# 使用 map 并行加载 numpy 文件
dataset = dataset.map(lambda item1, item2: tf.numpy_function(
map_func, [item1, item2], [tf.float32, tf.int32]),
num_parallel_calls=tf.data.experimental.AUTOTUNE)
# 打乱次序和批处理
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
模型
有趣的是,下面的解码器与 《使用注意力机制训练神经机器翻译》 (Neural Machine Translation with Attention) 一文中的示例所用的解码器完全相同 (https://tensorflow.google.cn/alpha/tutorials/sequences/nmt_with_attention?hl=zh-CN) 。
此模型架构的灵感来源于 《显示、加入及识别》 这一论文 (https://arxiv.org/pdf/1502.03044.pdf?hl=zh-CN) 。
在此示例中,我们会从 InceptionV3 的下卷积层中提取特征,进而得到形状为 (8, 8, 2048) 的向量
我们会将其压缩成 (64, 2048) 的形状
然后,通过由单个全连接层组成的卷积神经网络 (CNN) 编码器传递此向量
循环神经网络 (RNN)(在此处为门控循环单元,即 GRU)会加入此图像,以预测下一个字词
class BahdanauAttention(tf.keras.Model):
def __init__(self, units):
super(BahdanauAttention, self).__init__()
self.W1 = tf.keras.layers.Dense(units)
self.W2 = tf.keras.layers.Dense(units)
self.V = tf.keras.layers.Dense(1)
def call(self, features, hidden):
# features(CNN_encoder output) 形状为 (batch_size, 64, embedding_dim)
# 隐藏形状为 (batch_size, hidden_size)
# hidden_with_time_axis 形状为 (batch_size, 1, hidden_size)
hidden_with_time_axis = tf.expand_dims(hidden, 1)
# score 形状为 (batch_size, 64, hidden_size)
score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))
# attention_weights 形状为 (batch_size, 64, 1)
# 将最后一个 axis 取 1,因为我们会为 self.V 赋予 score 值
attention_weights = tf.nn.softmax(self.V(score), axis=1)
# 相加后的 context_vector 形状为 (batch_size, hidden_size)
context_vector = attention_weights * features
context_vector = tf.reduce_sum(context_vector, axis=1)
return context_vector, attention_weights
class CNN_Encoder(tf.keras.Model):
# 由于我们已提取特征,并使用 pickle 的 dump 函数对其进行了序列化,
# 因而此编码器会通过全连接层传递这些特征
def __init__(self, embedding_dim):
super(CNN_Encoder, self).__init__()
# 经过全连接层处理后的形状为 (batch_size, 64, embedding_dim)
self.fc = tf.keras.layers.Dense(embedding_dim)
def call(self, x):
x = self.fc(x)
x = tf.nn.relu(x)
return x
class RNN_Decoder(tf.keras.Model):
def __init__(self, embedding_dim, units, vocab_size):
super(RNN_Decoder, self).__init__()
self.units = units
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.units,
return_sequences=True,
return_state=True,
recurrent_initializer=’glorot_uniform’)
self.fc1 = tf.keras.layers.Dense(self.units)
self.fc2 = tf.keras.layers.Dense(vocab_size)
self.attention = BahdanauAttention(self.units)
def call(self, x, features, hidden):
# 将注意力定义为单独的模型
context_vector, attention_weights = self.attention(features, hidden)
# 通过嵌入传递后的 x 形状为 (batch_size, 1, embedding_dim)
x = self.embedding(x)
# 串联后的 x 形状为 (batch_size, 1, embedding_dim + hidden_size)
x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
# 将串联后的向量传递给 GRU
output, state = self.gru(x)
# 形状为 (batch_size, max_length, hidden_size)
x = self.fc1(output)
# x 形状为 (batch_size * max_length, hidden_size)
x = tf.reshape(x, (-1, x.shape[2]))
# 输出形状为 (batch_size * max_length, vocab)
x = self.fc2(x)
return x, state, attention_weights
def reset_state(self, batch_size):
return tf.zeros((batch_size, self.units))
encoder = CNN_Encoder(embedding_dim)
decoder = RNN_Decoder(embedding_dim, units, vocab_size)
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True, reduction=’none’)
def loss_function(real, pred):
mask = tf.math.logical_not(tf.math.equal(real, 0))
loss_ = loss_object(real, pred)
mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= mask
return tf.reduce_mean(loss_)
检查点
checkpoint_path = “./checkpoints/train”
ckpt = tf.train.Checkpoint(encoder=encoder,
decoder=decoder,
optimizer = optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)
start_epoch = 0
if ckpt_manager.latest_checkpoint:
start_epoch = int(ckpt_manager.latest_checkpoint.split(‘-‘)[-1])
训练
我们提取分别存储在 .npy 文件中的特征,然后通过编码器传递这些特征
将编码器输出、隐藏状态(初始化为 0)和解码器输入(起始标记)传递给解码器
解码器返回预测结果和解码器隐藏状态
然后将解码器隐藏状态传回模型,并根据预测结果计算损失
使用 Teacher Forcing 确定要传递给解码器的下一个输入
Teacher Forcing 技术可将目标字词作为下一个输入传递给解码器
最后一步是计算梯度,并将其应用于优化器和反向传播
# 在单独的单元中添加此数组,因为如果您多次运行训练单元,
#系统会重置 loss_plot 数组
loss_plot = []
@tf.function
def train_step(img_tensor, target):
loss = 0
# 初始化每个批次的隐藏状态
# 因为不同图像的字幕之间没有关联
hidden = decoder.reset_state(batch_size=target.shape[0])
dec_input = tf.expand_dims([tokenizer.word_index[‘<start>’]] * BATCH_SIZE, 1)
with tf.GradientTape() as tape:
features = encoder(img_tensor)
for i in range(1, target.shape[1]):
# 通过解码器传递特征
predictions, hidden, _ = decoder(dec_input, features, hidden)
loss += loss_function(target[:, i], predictions)
# 使用 Teacher Forcing
dec_input = tf.expand_dims(target[:, i], 1)
total_loss = (loss / int(target.shape[1]))
trainable_variables = encoder.trainable_variables + decoder.trainable_variables
gradients = tape.gradient(loss, trainable_variables)
optimizer.apply_gradients(zip(gradients, trainable_variables))
return loss, total_loss
EPOCHS = 20
for epoch in range(start_epoch, EPOCHS):
start = time.time()
total_loss = 0
for (batch, (img_tensor, target)) in enumerate(dataset):
batch_loss, t_loss = train_step(img_tensor, target)
total_loss += t_loss
if batch % 100 == 0:
print (‘Epoch {} Batch {} Loss {:.4f}’.format(
epoch + 1, batch, batch_loss.numpy() / int(target.shape[1])))
# 存储每次的最终损失值,以供之后绘制图像
loss_plot.append(total_loss / num_steps)
if epoch % 5 == 0:
ckpt_manager.save()
print (‘Epoch {} Loss {:.6f}’.format(epoch + 1,
total_loss/num_steps))
print (‘Time taken for 1 epoch {} sec\n’.format(time.time() – start))
Epoch 1 Batch 0 Loss 2.0556
Epoch 1 Batch 100 Loss 1.0668
Epoch 1 Batch 200 Loss 0.8879
Epoch 1 Batch 300 Loss 0.8524
Epoch 1 Loss 1.009767
Time taken for 1 epoch 256.95692324638367 sec
Epoch 2 Batch 0 Loss 0.8081
Epoch 2 Batch 100 Loss 0.7681
Epoch 2 Batch 200 Loss 0.6946
Epoch 2 Batch 300 Loss 0.7042
Epoch 2 Loss 0.756167
Time taken for 1 epoch 186.68594098091125 sec
Epoch 3 Batch 0 Loss 0.6851
Epoch 3 Batch 100 Loss 0.6817
Epoch 3 Batch 200 Loss 0.6316
Epoch 3 Batch 300 Loss 0.6391
Epoch 3 Loss 0.679992
Time taken for 1 epoch 186.36522102355957 sec
Epoch 4 Batch 0 Loss 0.6381
Epoch 4 Batch 100 Loss 0.6314
Epoch 4 Batch 200 Loss 0.5915
Epoch 4 Batch 300 Loss 0.5961
Epoch 4 Loss 0.635389
Time taken for 1 epoch 186.6236436367035 sec
Epoch 5 Batch 0 Loss 0.5991
Epoch 5 Batch 100 Loss 0.5896
Epoch 5 Batch 200 Loss 0.5607
Epoch 5 Batch 300 Loss 0.5670
Epoch 5 Loss 0.602497
Time taken for 1 epoch 187.06984400749207 sec
Epoch 6 Batch 0 Loss 0.5679
Epoch 6 Batch 100 Loss 0.5558
Epoch 6 Batch 200 Loss 0.5350
Epoch 6 Batch 300 Loss 0.5461
Epoch 6 Loss 0.575848
Time taken for 1 epoch 187.72310757637024 sec
Epoch 7 Batch 0 Loss 0.5503
Epoch 7 Batch 100 Loss 0.5283
Epoch 7 Batch 200 Loss 0.5120
Epoch 7 Batch 300 Loss 0.5242
Epoch 7 Loss 0.551446
Time taken for 1 epoch 187.74794459342957 sec
Epoch 8 Batch 0 Loss 0.5432
Epoch 8 Batch 100 Loss 0.5078
Epoch 8 Batch 200 Loss 0.5003
Epoch 8 Batch 300 Loss 0.4915
Epoch 8 Loss 0.529145
Time taken for 1 epoch 186.81623315811157 sec
Epoch 9 Batch 0 Loss 0.5156
Epoch 9 Batch 100 Loss 0.4842
Epoch 9 Batch 200 Loss 0.4923
Epoch 9 Batch 300 Loss 0.4677
Epoch 9 Loss 0.509899
Time taken for 1 epoch 189.49438571929932 sec
Epoch 10 Batch 0 Loss 0.4995
Epoch 10 Batch 100 Loss 0.4710
Epoch 10 Batch 200 Loss 0.4750
Epoch 10 Batch 300 Loss 0.4601
Epoch 10 Loss 0.492096
Time taken for 1 epoch 189.16131472587585 sec
Epoch 11 Batch 0 Loss 0.4797
Epoch 11 Batch 100 Loss 0.4495
Epoch 11 Batch 200 Loss 0.4552
Epoch 11 Batch 300 Loss 0.4408
Epoch 11 Loss 0.474645
Time taken for 1 epoch 190.57548332214355 sec
Epoch 12 Batch 0 Loss 0.4787
Epoch 12 Batch 100 Loss 0.4315
Epoch 12 Batch 200 Loss 0.4504
Epoch 12 Batch 300 Loss 0.4293
Epoch 12 Loss 0.457647
Time taken for 1 epoch 190.24215531349182 sec
Epoch 13 Batch 0 Loss 0.4621
Epoch 13 Batch 100 Loss 0.4107
Epoch 13 Batch 200 Loss 0.4271
Epoch 13 Batch 300 Loss 0.4133
Epoch 13 Loss 0.442507
Time taken for 1 epoch 187.96875071525574 sec
Epoch 14 Batch 0 Loss 0.4383
Epoch 14 Batch 100 Loss 0.3987
Epoch 14 Batch 200 Loss 0.4239
Epoch 14 Batch 300 Loss 0.3913
Epoch 14 Loss 0.429215
Time taken for 1 epoch 185.89738130569458 sec
Epoch 15 Batch 0 Loss 0.4121
Epoch 15 Batch 100 Loss 0.3933
Epoch 15 Batch 200 Loss 0.4079
Epoch 15 Batch 300 Loss 0.3788
Epoch 15 Loss 0.415965
Time taken for 1 epoch 186.6773328781128 sec
Epoch 16 Batch 0 Loss 0.4062
Epoch 16 Batch 100 Loss 0.3752
Epoch 16 Batch 200 Loss 0.3947
Epoch 16 Batch 300 Loss 0.3715
Epoch 16 Loss 0.402814
Time taken for 1 epoch 186.04795384407043 sec
Epoch 17 Batch 0 Loss 0.3793
Epoch 17 Batch 100 Loss 0.3604
Epoch 17 Batch 200 Loss 0.3941
Epoch 17 Batch 300 Loss 0.3504
Epoch 17 Loss 0.391162
Time taken for 1 epoch 187.62019681930542 sec
Epoch 18 Batch 0 Loss 0.3685
Epoch 18 Batch 100 Loss 0.3496
Epoch 18 Batch 200 Loss 0.3744
Epoch 18 Batch 300 Loss 0.3480
Epoch 18 Loss 0.382786
Time taken for 1 epoch 185.68778085708618 sec
Epoch 19 Batch 0 Loss 0.3608
Epoch 19 Batch 100 Loss 0.3384
Epoch 19 Batch 200 Loss 0.3500
Epoch 19 Batch 300 Loss 0.3229
Epoch 19 Loss 0.371033
Time taken for 1 epoch 185.8159191608429 sec
Epoch 20 Batch 0 Loss 0.3568
Epoch 20 Batch 100 Loss 0.3288
Epoch 20 Batch 200 Loss 0.3357
Epoch 20 Batch 300 Loss 0.2945
Epoch 20 Loss 0.358618
Time taken for 1 epoch 186.8766734600067 sec
plt.plot(loss_plot)
plt.xlabel(‘Epochs’)
plt.ylabel(‘Loss’)
plt.title(‘Loss Plot’)
plt.show()
重要说明!
评估函数与训练循环类似,只是我们并未在本示例中使用 Teacher Forcing。每个时步的解码器输入均为先前的预测结果,以及隐藏状态和编码器输出
请在模型预测结束标记时停止预测
请存储每个时步的注意力权重
def evaluate(image):
attention_plot = np.zeros((max_length, attention_features_shape))
hidden = decoder.reset_state(batch_size=1)
temp_input = tf.expand_dims(load_image(image)[0], 0)
img_tensor_val = image_features_extract_model(temp_input)
img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3]))
features = encoder(img_tensor_val)
dec_input = tf.expand_dims([tokenizer.word_index[‘<start>’]], 0)
result = []
for i in range(max_length):
predictions, hidden, attention_weights = decoder(dec_input, features, hidden)
attention_plot[i] = tf.reshape(attention_weights, (-1, )).numpy()
predicted_id = tf.argmax(predictions[0]).numpy()
result.append(tokenizer.index_word[predicted_id])
if tokenizer.index_word[predicted_id] == ‘<end>’:
return result, attention_plot
dec_input = tf.expand_dims([predicted_id], 0)
attention_plot = attention_plot[:len(result), :]
return result, attention_plot
def plot_attention(image, result, attention_plot):
temp_image = np.array(Image.open(image))
fig = plt.figure(figsize=(10, 10))
len_result = len(result)
for l in range(len_result):
temp_att = np.resize(attention_plot[l], (8, 8))
ax = fig.add_subplot(len_result//2, len_result//2, l+1)
ax.set_title(result[l])
img = ax.imshow(temp_image)
ax.imshow(temp_att, cmap=’gray’, alpha=0.6, extent=img.get_extent())
plt.tight_layout()
plt.show()
# 使用验证集测试字幕
rid = np.random.randint(0, len(img_name_val))
image = img_name_val[rid]
real_caption = ‘ ‘.join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])
result, attention_plot = evaluate(image)
print (‘Real Caption:’, real_caption)
print (‘Prediction Caption:’, ‘ ‘.join(result))
plot_attention(image, result, attention_plot)
# 打开图像
Image.open(img_name_val[rid])
Real Caption: <start> a man gets ready to hit a ball with a bat <end>
Prediction Caption: a baseball player begins to bat <end>
用自己的图像试一试
下面我们提供了一种方法,方便您借助我们刚刚训练的模型为自己的图像制作字幕,寓学于乐。请记住,此模型使用的训练数据相对较少,而且您的图像可能会与训练数据有所不同(因此您或许会看到奇怪的结果!)
image_url = ‘https://tensorflow.org/images/surf.jpg’
image_extension = image_url[-4:]
image_path = tf.keras.utils.get_file(‘image’+image_extension,
origin=image_url)
result, attention_plot = evaluate(image_path)
print (‘Prediction Caption:’, ‘ ‘.join(result))
plot_attention(image_path, result, attention_plot)
# 打开图像
Image.open(image_path)
Prediction Caption: a man riding a surf board in the water <end>
后续步骤
恭喜!您刚刚使用注意力机制训练了图像字幕制作模型。接下来,建议您仔细看看以下文章中的示例:《使用注意力机制训练神经机器翻译》。其中使用类似的架构,旨在实现西班牙语和英语句子的互译。此外,您还可以尝试使用不同的数据集训练此笔记中的代码。
Be First to Comment