Press "Enter" to skip to content

【Pre-Training】Transformers 源码阅读和实践

本站内容均来自兴趣收集,如不慎侵害的您的相关权益,请留言告知,我们将尽快删除.谢谢.

本文主要针对HuggingFace开源的
transformers ,以BERT为例介绍其源码并进行一些实践。
主要以pytorch为例 (tf 2.0 代码风格几乎和pytorch一致),介绍BERT使用的Transformer Encoder,Pre-training Tasks和Fine-tuning Tasks。
最后,针对预训练好的BERT进行简单的实践,例如产出语句embeddings,预测目标词以及进行抽取式问答。
本文主要面向BERT新手,在阅读本文章前,假设读者已经阅读过
BERT 原论文。

 

1. Core Components

 

Transformers: State-of-the-art Natural Language Processing

 

参考上面的论文,transformers开源库的核心组件包括3个:

 

「Configuration」:配置类,通常继承自 「PretrainedConfig」
,保存model或tokenizer的超参数,例如词典大小,隐层维度数,dropout rate等。配置类主要可用于复现模型。

 

「Tokenizer」:切词类,通常继承自 「PreTrainedTokenizer」
,主要存储词典,token到index映射关系等。此外,还会有一些model-specific的特性,如特殊token,[SEP], [CLS]等的处理,token的type类型处理,语句最大长度等,因此tokenizer通常和模型是一对一适配的。比如BERT模型有BertTokenizer。Tokenizer的实现方式有多种,如word-level, character-level或者subword-level,其中subword-level包括Byte-Pair-Encoding,WordPiece。subword-level的方法目前是transformer-based models的主流方法,能够有效解决OOV问题,学习词缀之间的关系等。Tokenizer主要为了 「将原始的语料编码成适配模型的输入。」

 

「Model」: 模型类。封装了预训练模型的计算图过程,遵循着相同的范式,如根据token ids进行embedding matrix映射,紧接着多个self-attention层做编码,最后一层task-specific做预测。除此之外,Model还可以做一些灵活的扩展,用于下游任务,例如在预训练好的Base模型基础上,添加task-specific heads。比如,language model heads,sequence classification heads等。在代码库中通常命名为, 「XXXForSequenceClassification」
or 「XXXForMaskedLM」
,其中XXX是模型的名称(如Bert), 结尾是预训练任务的名称 (MaskedLM) 或下游任务的类型(SequenceClassification)。

 

另外,针对上述三大类,transformer还额外封装了 「AutoConfig, AutoTokenizer,AutoModel」
,可通过模型的命名来定位其所属的具体类,比如’bert-base-cased’,就可以知道要加载BERT模型相关的配置、切词器和模型。非常方便。通常上手时,我们都会用Auto封装类来加载切词器和模型。

 

2. Transformer-based Pre-trained model

 

所有已实现的Transformer-based Pre-trained models:

 

CONFIG_MAPPING = OrderedDict(
    [
        ("retribert", RetriBertConfig,),
        ("t5", T5Config,),
        ("mobilebert", MobileBertConfig,),
        ("distilbert", DistilBertConfig,),
        ("albert", AlbertConfig,),
        ("camembert", CamembertConfig,),
        ("xlm-roberta", XLMRobertaConfig,),
        ("marian", MarianConfig,),
        ("mbart", MBartConfig,),
        ("bart", BartConfig,),
        ("reformer", ReformerConfig,),
        ("longformer", LongformerConfig,),
        ("roberta", RobertaConfig,),
        ("flaubert", FlaubertConfig,),
        ("bert", BertConfig,),
        ("openai-gpt", OpenAIGPTConfig,),
        ("gpt2", GPT2Config,),
        ("transfo-xl", TransfoXLConfig,),
        ("xlnet", XLNetConfig,),
        ("xlm", XLMConfig,),
        ("ctrl", CTRLConfig,),
        ("electra", ElectraConfig,),
        ("encoder-decoder", EncoderDecoderConfig,),
    ]

 

上述是该开源库实现的模型,包括了BERT,GPT2,XLNet,RoBERTa,ALBERT,ELECTRA,T5等家喻户晓的预训练语言模型。

 

下面将以BERT为例,来介绍BERT相关的源码。建议仔细阅读源码中我做的一些 「注释」
,尤其是 「步骤的细分」
。同时,关注下目录的层次, 「即:不同类之间的关系。」

 

2.1 BertModel Transformer

 

「BertModel」, The bare Bert Model transformer outputting 「raw hidden-states」
without any specific head on top。这个类的目标主要就是利用 「Transformer」
获取序列的编码向量。抽象出来的目标是为了适配不同的预训练任务。例如:MLM预训练任务对应的类为BertForMaskedLM,其中有个成员实例为BertModel,就是为了编码序列,获取序列的hidden states后,再构建MaskedLM task进行训练或者预测。

 

核心构造函数和Forward流程代码如下:

 

# BertModel的构造函数
def __init__(self, config):
    super().__init__(config)
    self.config = config
    self.embeddings = BertEmbeddings(config)
    self.encoder = BertEncoder(config)
    self.pooler = BertPooler(config)
    self.init_weights()
def forward(self, input_ids=None, attention_mask=None,token_type_ids=None,    
            position_ids=None, head_mask=None, inputs_embeds=None,
            encoder_hidden_states=None, encoder_attention_mask=None,
            output_attentions=None, output_hidden_states=None,):
# ignore some code here...
# step 1: obtain sequence embedding, BertEmbeddings 
    embedding_output = self.embeddings(
        input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, 
        inputs_embeds=inputs_embeds)
# step 2: transformer encoder, BertEncoder
    encoder_outputs = self.encoder(
        embedding_output,
        attention_mask=extended_attention_mask,
        head_mask=head_mask,
        encoder_hidden_states=encoder_hidden_states,
        encoder_attention_mask=encoder_extended_attention_mask,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
    )
    sequence_output = encoder_outputs[0]
# step 3: pooling to obtain sequence-level encoding, BertPooler
    pooled_output = self.pooler(sequence_output)
    outputs = (sequence_output, pooled_output,) + encoder_outputs[1:]
return outputs  # sequence_output, pooled_output, (hidden_states), (attentions)

 

「参数如下:」

「input_ids」
: 带特殊标记([CLS]、[SEP])的 「token ids」
, e.g., tensor([[ 101, 1188, 1110, 1126, 7758, 1859, 102]])
, 其中101和102分别是[CLS],[SEP]对应的token id。其 「shape」
:, 「B」
为batch size, 「S」
为序列的长度,此例即:1×7。

「inputs_embeds:」和input_ids参数 「二选一」
。inputs_embeds代表给定了输入tokens对应的token embeddings,比如用word2vec的word embeddings作为token embeddings,这样就不需要用input_ids对默认随机初始化的embedding做lookup得到token embeddings。

 

「attention_mask」: 「self-attention使用」
,可选,shape和input_ids一致。当对encoder端的序列做self-attention时,默认全为1,即都可以attend;decoder端序列做self-attention时,默认为类似下三角矩阵的形式 (对角线也为1)。

 

「token_type_ids」: 可选,shape和input_ids一致,单语句输入时,取值全为0;在“语句对“的输入中,该取值为0或1,即:前一句为0,后一句为1。

 

「head_mask」: **self-attention使用,**可选,想用哪些head,就为1或者None,不想用的head就为0。shape为[num_heads] or [num_hidden_layers x num_heads],即:可以每层每个head单独设置mask。

 

「position_ids」: 可选,位置id,默认就是0~S。

 

「encoder_hidden_states/encoder_attention_mask」:decoder端对encoder端做cross-attention时使用,此时K和V即通过encoder_hidden_states得到。

 

其中,

 

「Step 1」: 「获取序列的embedding」
,对应下文要介绍的 「BertEmbeddings」

 

「Step 2」: 「利用Transformer进行编码」
,对应下文要介绍的 「BertEncoder」
,获取sequence token-level encoding.

 

「Step 3」: 「对 [CLS] 对应的hidden state进行非线性变换得到」
sequence-level encoding,对应下文要介绍的 「BertPooler」

 

2.2 BertEmbeddings

 

「第一步Step 1」,获取序列的embeddings

 

「token embedding + position embedding + segment embedding」

 

embedding_output = self.embeddings(
    input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds) # embeddings是BertEmbeddings类

 

基于input_ids或者inputs_embeds获取token embeddings。

 

基于position_ids获取position embeddings,此处采用的是绝对位置编码。

 

基于token_type_ids获取语句的segment embeddings。

 

# BertEmbeddings core forward code: 
def forward(self, input_ids=None, token_type_ids=None,
            position_ids=None, inputs_embeds=None):
# ignore some codes here...
# step 1: token embeddings
if inputs_embeds is None:
        inputs_embeds = self.word_embeddings(input_ids) # token embeddings
# step 2: position embeddings
    position_embeddings = self.position_embeddings(position_ids)
# step 3: segment embeddings
    token_type_embeddings = self.token_type_embeddings(token_type_ids)
    embeddings = inputs_embeds + position_embeddings + token_type_embeddings
    embeddings = self.LayerNorm(embeddings)
    embeddings = self.dropout(embeddings)
return embeddings

 

此处还做了layer_norm和dropout。输出的embedding的shape为,。D默认为768。此处输出的embeddings标记为。

 

2.3 BertEncoder

 

「第二步,step 2」,利用 「Transformer」
对序列进行编码

 

# encoder是BertEncoder类
encoder_outputs = self.encoder(
     embedding_output, # 序列embedding, B x S x D
     attention_mask=extended_attention_mask, # 序列self-attention时使用
     head_mask=head_mask, # 序列self-attention时使用
     encoder_hidden_states=encoder_hidden_states, # decoder,cross-attention
     encoder_attention_mask=encoder_extended_attention_mask, # cross-attention
     output_attentions=output_attentions, # 是否输出attention
     output_hidden_states=output_hidden_states)  # 是否输出每层的hidden state

 

「embedding_output」:BertEmbeddings的输出,batch中样本序列的每个token的嵌入。

 

「extended_attention_mask」: 「self-attention」
使用。根据attention_mask做维度广播,是head数量,此时,方便下文做self-attention时作mask,即:softmax前对logits作处理, 「logits+extended_attention_mask」
,即:attention_mask取值为1时,extended_attention_mask对应位置的取值为0;否则,attention_mask为0时,extended_attention_mask对应位置的取值为-10000.0 (很小的一个数),这样softmax后,mask很小的值对应的位置概率接近0达到mask的目的。

 

「head_mask」: 「self-attention」
使用。同样可以基于 「原始输入head_mask作维度广播」
,广播前的shape为H or L x H;广播后的shape为: 「L x B x H x S x S」
。即每个样本序列中每个token对其他tokens的head attentions 值作mask,head attentions数量为L x H。

 

「encoder_hidden_states」:可选, 「cross-attention使用」
。即:decoder端做编码时,要传入encoder的隐状态, 「B x S x D」

 

「encoder_attention_mask」:可选, 「cross-attention使用」
。即,decoder端做编码时,encoder的隐状态的attention mask。和extended_attention_mask类似, 「B x S。」

 

「output_attentions」:是否输出attention值,bool。可用于可视化attention scores。

 

「output_hidden_states」:是否输出每层得到的隐向量,bool。

 

# BertEncoder由12层BertLayer构成
self.layer = nn.ModuleList([BertLayer(config) for _ in range(config.num_hidden_layers)])

 

# BertEncoder Forward核心代码
def forward(self, hidden_states,
        attention_mask=None, head_mask=None,
        encoder_hidden_states=None, encoder_attention_mask=None,
        output_attentions=False, output_hidden_states=False):
# ignore some codes here...
    all_hidden_states = ()
    all_attentions = ()
for i, layer_module in enumerate(self.layer): # 12层BertLayer
if output_hidden_states:
           all_hidden_states = all_hidden_states + (hidden_states,)
# step 1: BertLayer iteration
           layer_outputs = layer_module(
                hidden_states,
                attention_mask,
                head_mask[i],
                encoder_hidden_states,
                encoder_attention_mask,
                output_attentions) # BertLayer Forward,核心!!!
           hidden_states = layer_outputs[0] # overide for next iteration
if output_attentions:
               all_attentions = all_attentions + (layer_outputs[1],) # 存每层的attentions,可以用于可视化
# Add last layer
if output_hidden_states:
        all_hidden_states = all_hidden_states + (hidden_states,)
    outputs = (hidden_states,)
if output_hidden_states:
       outputs = outputs + (all_hidden_states,)
if output_attentions:
       outputs = outputs + (all_attentions,)
return outputs  # last-layer hidden state, (all hidden states), (all attentions)

 

2.4 BertLayer

 

上述代码最重要的是循环内的 「BertLayer」
迭代过程,其核心代码:

 

def forward(self, hidden_states, attention_mask=None, head_mask=None,
            encoder_hidden_states=None, encoder_attention_mask=None,
            output_attentions=False,):
# step 1.0: self-attention, attention实例是BertAttention类
    self_attention_outputs = self.attention(
        hidden_states, attention_mask, head_mask, output_attentions=output_attentions,
    )
    attention_output = self_attention_outputs[0]
    outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights
# step 1.1: 如果是decoder, 就作cross-attention,此时step1.0的输出即为decoder侧的序列的self-attention结果,并作为step1.1的输入;step 1.1的输出为decoder侧的cross-attention结果, crossattention实例也是BertAttention
if self.is_decoder and encoder_hidden_states is not None:
        cross_attention_outputs = self.crossattention(
            attention_output,
            attention_mask,
            head_mask,
            encoder_hidden_states,
            encoder_attention_mask,
            output_attentions,
        )
        attention_output = cross_attention_outputs[0]
        outputs = outputs + cross_attention_outputs[1:]  # add cross attentions if we output attention weights
# step 2: intermediate转化,对应原论文中的前馈神经网络FFN
    intermediate_output = self.intermediate(attention_output)
# step 3: 做skip-connection
    layer_output = self.output(intermediate_output, attention_output)
    outputs = (layer_output,) + outputs
return outputs

 

其中,step 1分为了2个小步骤。如果是encoder (BERT只用了encoder),只有1.0起作用,即只对输入序列进行self-attention。如果是做seq2seq的模型,还会用到transformer的decoder,此时1.0就是对decoder的seq做self-attention,相应的attention_mask实际上是类下三角形式的矩阵;而1.1步骤此时就是基于1.0得到的self-attention序列的hidden states,对encoder_hidden_states进行cross-attention。这是本部分的重点。

 

2.4.1 BertAttention

 

BertAttention是上述代码中attention实例对应的类,也是transformer进行self-attention的核心类。包括了BertSelfAttention和BertSelfOutput成员。

 

class BertAttention(nn.Module):
def __init__(self, config):
        super().__init__()
        self.self = BertSelfAttention(config)
        self.output = BertSelfOutput(config)
def forward(self, hidden_states, attention_mask=None,
          head_mask=None, encoder_hidden_states=None,
          encoder_attention_mask=None, output_attentions=False):
# step 1: self-attention, B x S x D
        self_outputs = self.self(
            hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions)
# step 2: skip-connection, B x S x D
        attention_output = self.output(self_outputs[0], hidden_states)
        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
return outputs

 

「BertSelfAttention」: 是 「self-attention」
,BertSelfAttention可以被实例化为encoder侧的self-attention,也可以被实例化为decoder侧的self-attention,此时attention_mask是非空的 (类似下三角形式的mask矩阵)。同时,还可以实例化为decoder侧的cross-attention,此时,hidden_states即为decoder侧序列的self-attention结果,同时需要传入encoder侧的encoder_hidden_states和encoder_attention_mask来进行cross-attention。

 

def forward(self, hidden_states, attention_mask=None, head_mask=None,
      encoder_hidden_states=None, encoder_attention_mask=None,
      output_attentions=False):
# step 1: mapping Query/Key/Value to sub-space
# step 1.1: query mapping
    mixed_query_layer = self.query(hidden_states) # B x S x (H*d)
# If this is instantiated as a cross-attention module, the keys
# and values come from an encoder; the attention mask needs to be
# such that the encoder's padding tokens are not attended to.
# step 1.2: key/value mapping
if encoder_hidden_states is not None:
        mixed_key_layer = self.key(encoder_hidden_states) # B x S x (H*d)
        mixed_value_layer = self.value(encoder_hidden_states) 
        attention_mask = encoder_attention_mask 
else:
        mixed_key_layer = self.key(hidden_states) # B x S x (H*d)
        mixed_value_layer = self.value(hidden_states)
    query_layer = self.transpose_for_scores(mixed_query_layer) # B x H x S x d
    key_layer = self.transpose_for_scores(mixed_key_layer) # B x H x S x d
    value_layer = self.transpose_for_scores(mixed_value_layer) # B x H x S x d
# step 2: compute attention scores
# step 2.1: raw attention scores
# B x H x S x d   B x H x d x S -> B x H x S x S
# Take the dot product between "query" and "key" to get the raw attention scores.
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
    attention_scores = attention_scores / math.sqrt(self.attention_head_size)
# step 2.2: mask if necessary
if attention_mask is not None:
# Apply the attention mask, B x H x S x S
     attention_scores = attention_scores + attention_mask
# step 2.3: Normalize the attention scores to probabilities, B x H x S x S
    attention_probs = nn.Softmax(dim=-1)(attention_scores)
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
    attention_probs = self.dropout(attention_probs)
# Mask heads if we want to
if head_mask is not None:
        attention_probs = attention_probs * head_mask
# B x H x S x S   B x H x S x d ->  B x H x S x d
# step 4: aggregate values by attention probs to form context encodings
    context_layer = torch.matmul(attention_probs, value_layer)
# B x S x H x d
    context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
# B x S x D
    new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
# B x S x D,相当于是多头concat操作
    context_layer = context_layer.view(*new_context_layer_shape)
    outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
return outputs

 

不同head均分768维度,12个head则每个为64维度;具体计算的时候合在一起,即同时算multi-head。记本步骤的输出为:,输入即为hidden_states参数。

 

;  ,每个token根据其对序列内其它tokens的attention scores,来加权序列tokens的embeddings,得到每个token对应的上下文编码向量。

 

reshape后的形状为,,。

 

:  , 「transpose_for_scores」

 

, 如果是decoder侧的self-attention,则logit加上预先计算好的decoder侧对应的序列的每个位置的attention_mask,实际上就是下三角形式(包括对角线)的mask矩阵。

 

,:每个batch每个head内,每个token对序列内其它token的attention score。

 

的shape:  : 「<Batch Size, Seq Length, Head Num, Embedding Dimension>」
,。此处D=768, H=12, d=64。

 

「attention score计算过程:」

 

「context_layer」:  :

 

「BertSelfOutput」

 

, 「self-connection」
,

 

2.4.2 BertIntermediate

 

,, 其中,默认值为3072,用到了gelu激活函数。

 

2.4.3 BertOutput

 

,,其中,.

 

上述输出作为下一个BertLayer的输入,,依次类推,进行迭代,最终输出,即共12层BertLayer。

 

2.5 BertPooler

 

第三步,step3, 获取sequence-level embedding。

 

拿到上述BertEncoder的输出,shape为,其中每个样本序列(S维度)的第一个token为[CLS]标识的hidden state,标识为,即:。则得到序列级别的嵌入表征:,shape为。这个主要用于下游任务的fine-tuning。

 

def forward(self, hidden_states):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token.
    first_token_tensor = hidden_states[:, 0]
    pooled_output = self.dense(first_token_tensor)
    pooled_output = self.activation(pooled_output) ## nn.tanh
return pooled_output

 

3. Bert Pre-training Tasks

 

上文介绍了BERT核心的Transformer编码器,下面将介绍Bert的预训练任务。

 

3.1 BertForMaskedLM

 

Bert Model with 「a language modeling head」
on top。上述介绍了BertModel的源码,BertModel主要用于获取序列的编码。本部分要介绍的BertForMaskedLM将基于BertModel得到的序列编码,利用MaskedLM预训练任务进行预训练。

 

Bert主要利用了Transformer的Encoder,基于encoder得到的序列编码进行预训练,而MLM使得encoder能够进行双向的self-attention。

 

「BertForMaskedLM」的构造函数:

 

def __init__(self, config):
    super().__init__(config)
assert (
not config.is_decoder
    ), "If you want to use `BertForMaskedLM` make sure `config.is_decoder=False` for bi-directional self-attention." # is_decoder为False,不需要用到decoder
    self.bert = BertModel(config) # BertModel进行序列编码
    self.cls = BertOnlyMLMHead(config) # 多分类预训练任务, task-specific head
    self.init_weights()

 

核心Forward代码:

 

def forward(self, input_ids=None, attention_mask=None, 
            token_type_ids=None,position_ids=None, 
            head_mask=None, inputs_embeds=None, labels=None,
            encoder_hidden_states=None, encoder_attention_mask=None,
            output_attentions=None, output_hidden_states=None,
            **kwargs):
# step 1: obtain sequence encoding by BertModel
    outputs = self.bert(
        input_ids,
        attention_mask=attention_mask,
        token_type_ids=token_type_ids,
        position_ids=position_ids,
        head_mask=head_mask,
        inputs_embeds=inputs_embeds,
        encoder_hidden_states=encoder_hidden_states,
        encoder_attention_mask=encoder_attention_mask,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
    )
    sequence_output = outputs[0] # B x S x D
# step 2: output scores of each token in the sequence
    prediction_scores = self.cls(sequence_output) # B x S x V, 输出词典中每个词的预测概率
    outputs = (prediction_scores,) + outputs[2:]  # Add hidden states and attention if they are here
# step 3: build loss, label, B x S
if labels is not None:
        loss_fct = CrossEntropyLoss()  # -100 index = padding token
        masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1)) # 拍扁, (B*S) x V
        outputs = (masked_lm_loss,) + outputs
return outputs  # (masked_lm_loss), prediction_scores, (hidden_states), (attentions)

 

参数基本上和BertModel一模一样,多了一个labels参数,主要用于获取MLM loss。

 

其中,cls对应的 「BertOnlyMLMHead」
类 (其实就是类 「BertLMPredictionHead」
) 做的主要事情如下公式,即:MLM多分类预测任务,其中为BertModel得到的sequence-token-level encoding,shape为。

 

其中,,为vocab的大小。的shape为:。

 

特别的,label的形式:

 

「labels」( torch.LongTensor
of shape (batch_size, sequence_length)
, optional, defaults to None
) – Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size]
(see input_ids
docstring) Tokens with indices set to -100
are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]

 

即,不打算预测的, 「label设置为-100」
。一般只设置[MASK]位置对应的label,其它位置设置成-100。这样只计算了[MASK]待预测位置的token对应的loss。-100实际上是 CrossEntropyLos
ignore_index
参数的默认值。

 

3.2 BertForPreTraining

 

和BertForMaskedLM类似,多了一个next sentence prediction预训练任务。Bert Model with 「two heads on top」
as done during the pre-training: a 「masked language modeling」
head and 「a next sentence prediction」
(classification) head.

 

此部分对应的heads的核心代码为:

 

class BertPreTrainingHeads(nn.Module):
def __init__(self, config):
        super().__init__()
        self.predictions = BertLMPredictionHead(config)
        self.seq_relationship = nn.Linear(config.hidden_size, 2)
def forward(self, sequence_output, pooled_output):
        prediction_scores = self.predictions(sequence_output)
        seq_relationship_score = self.seq_relationship(pooled_output)
return prediction_scores, seq_relationship_score

 

其中,BertLMPredictionHead和BertForMaskedLM中的BertLMPredictionHead一样,通过这个来得到MLM loss。另外,多了一个seq_relationship,即拿pooled encoding接一个线性二分类层,判断是否是next sentence,因此可以构造得到next-sentence loss。二者Loss相加。

 

3.3 BertForNextSentencePrediction

 

Bert Model with a next sentence prediction (classification) head on top。只有上述的seq_relationship head来构造next-sentence loss,不作赘述。

 

4. Bert Fine-tuning Tasks

 

下面将介绍利用预训练好的Bert对下游任务进行Fine-tuning的方式。下文介绍的fine-tuning任务对应的model,已经在BERT基础上加了task-specific parameters,只需要利用该model,输入task-specific data,然后optimization一下,就能够得到fine-tuned model。

 

4.1 BertForSequenceClassification

 

句子级别的任务,sentence-level task。Bert Model transformer with a sequence classification/regression head on top  (a linear layer on top of the pooled output) e.g. 「for GLUE tasks.」

 

class BertForSequenceClassification(BertPreTrainedModel):
def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.bert = BertModel(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels) # 类别数量
        self.init_weights()
# forward输入参数和前文介绍的预训练任务一样
def forward(self, input_ids=None, attention_mask=None,
                token_type_ids=None, position_ids=None,
                head_mask=None, inputs_embeds=None, labels=None,
                output_attentions=None, output_hidden_states=None):
# step 1: transformer encoding
        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states)
# step 2: use the pooled hidden state corresponding to the [CLS] token
# B x D
        pooled_output = outputs[1]
        pooled_output = self.dropout(pooled_output)
# B x N
        logits = self.classifier(pooled_output)
        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here
# step 3: build loss,  labels: (B, )
if labels is not None:
if self.num_labels == 1:
#  We are doing regression
                loss_fct = MSELoss()
                loss = loss_fct(logits.view(-1), labels.view(-1))
else:
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            outputs = (loss,) + outputs
return outputs  # (loss), logits, (hidden_states), (attentions)

 

看上述代码,非常清晰。先经过BertModel得到encoding,由于是sentence-level classification,直接拿第一个[CLS] token对应的hidden state过一个分类层得到类别的预测分数logits。再基于logits和labels来构造损失函数。这个任务主要用于sentence-level的分类任务,当然也能够用于sentence-pair-level的分类任务。

 

4.2 BertForMultipleChoice

 

句子对级别的任务, 「sentence-pair」
-level task。Bert Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a softmax) e.g. for 「RocStories/SWAG tasks.」

 

给一个提示prompt以及多个选择choice(其中有1个是对的,其它是错的),判断其中哪个选择是对的。 「输入格式会整成[[prompt, choice0], [prompt, choice1]…]的形式」
。bertModel得到的pooled基础上接一个全连接层,输出在每个“句对“[prompt, choice i]上的logits,然后过一个softmax,构造交叉熵损失。

 

4.3 BertForTokenClassification

 

token级别的下游任务,token-level task。Bert Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for 「Named-Entity-Recognition (NER) tasks.」

 

def forward(self, input_ids=None, attention_mask=None,
            token_type_ids=None, position_ids=None, head_mask=None,
            inputs_embeds=None, labels=None,
            output_attentions=None, output_hidden_states=None):    
# step 1: Transformer
   outputs = self.bert(input_ids, attention_mask=attention_mask,
                       token_type_ids=token_type_ids, position_ids=position_ids,
                       head_mask=head_mask, inputs_embeds=inputs_embeds,
                       output_attentions=output_attentions,
                       output_hidden_states=output_hidden_states)
# step 2: get sequence-token encoding, B x S x D
    sequence_output = outputs[0]
# step 3: fine-tuning parameters
    sequence_output = self.dropout(sequence_output)
# B x S x N
    logits = self.classifier(sequence_output) # nn.Linear(config.hidden_size, config.num_labels)
    outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here
# step 4: build loss, labels, B x S
if labels is not None:
        loss_fct = CrossEntropyLoss()
# Only keep active parts of the loss
if attention_mask is not None:
            active_loss = attention_mask.view(-1) == 1
            active_logits = logits.view(-1, self.num_labels)
            active_labels = torch.where(
                active_loss, labels.view(-1), 
                torch.tensor(loss_fct.ignore_index).type_as(labels))
            loss = loss_fct(active_logits, active_labels)
else:
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
         outputs = (loss,) + outputs
return outputs  # (loss), scores, (hidden_states), (attentions)

 

上述代码一目了然。不作赘述。主要应用于token-level的分类任务,如NER等。

 

4.4 BertForQuestionAnswering

 

句子对级别的任务, 「sentence-pair」
-level task,具体而言,即抽取式问答任务。Bert Model with a 「span classification head on top」
for extractive question-answering tasks like SQuAD (a linear layers on top of the hidden-states output to compute span start logits and span end logits).

 

class BertForQuestionAnswering(BertPreTrainedModel):
def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.bert = BertModel(config)
# num_labels为2, 分别代表start_position/end_position对应的下游参数。
        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
        self.init_weights()
# 多了俩参数,start_positions,end_positions,抽取式问答的span label, shape都是(B, )
def forward(self, input_ids=None, attention_mask=None,
                 token_type_ids=None, position_ids=None,
                 head_mask=None, inputs_embeds=None,
                 start_positions=None, end_positions=None,
                 output_attentions=None, output_hidden_states=None):
# step 1: Transformer encoding
        outputs = self.bert(
            input_ids, # question, passage 
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,)
# B x S x D
        sequence_output = outputs[0]
# step 2: split to obtain start and end logits
# B x S x N (N为labels数量,此处N=2)
        logits = self.qa_outputs(sequence_output)
# split后, B x S x 1, B x S x 1
        start_logits, end_logits = logits.split(1, dim=-1)
# B x S, B x S
        start_logits = start_logits.squeeze(-1)
        end_logits = end_logits.squeeze(-1)
        outputs = (start_logits, end_logits,) + outputs[2:]
# step 3: build loss,  start_positions, end_positions: (B, )
if start_positions is not None and end_positions is not None:
# If we are on multi-GPU, split add a dimension
if len(start_positions.size()) > 1:
                start_positions = start_positions.squeeze(-1)
if len(end_positions.size()) > 1:
                end_positions = end_positions.squeeze(-1)
# sometimes the start/end positions are outside our model inputs, we ignore these terms
            ignored_index = start_logits.size(1)
            start_positions.clamp_(0, ignored_index)
            end_positions.clamp_(0, ignored_index)
# S 分类
            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
            start_loss = loss_fct(start_logits, start_positions)
            end_loss = loss_fct(end_logits, end_positions)
            total_loss = (start_loss + end_loss) / 2
            outputs = (total_loss,) + outputs
return outputs  # (loss), start_logits, end_logits, (hidden_states), (attentions)

 

上述代码主要就是拿sequence-token-level hidden states接两个全连接层,分别输出start_position预测的logits和end_position预测的logits。

 

5. Bert Practice

 

本部分进行Bert的实践,包括3个部分:

 

利用预训练好的BERT模型,输出目标语句的Embeddings。

 

利用预训练好的BERT模型,预测目标语句中[MASK]位置的真实词。

 

利用预训练好的BERT模型,进行抽取式问答系统。

 

目前该库实现的预训练模型如下:

 

bert-base-chinese

 

bert-base-uncased

 

bert-base-cased

 

bert-base-german-cased

 

bert-base-multilingual-uncased

 

bert-base-multilingual-cased

 

bert-large-cased

 

bert-large-uncased

 

bert-large-uncased-whole-word-masking

 

bert-large-cased-whole-word-masking

 

上述预训练好的模型的主要差异在于:

 

预训练时的文本语言语料,中文、英文、德文、多语言等

 

有无大小写区分

 

层数

 

预训练时遮盖的是 wordpieces 得到的sub-word 还是整个word

 

接下来主要采用’bert-base-cased’。在QA部分还会使用上述预训练模型‘bert-large-uncased-whole-word-masking’在SQUAD上的fine-tuning好的模型进行推断。

 

首先加载「切割器和模型:」

 

MODEL_NAME = "bert-base-cased"
# step 1: 先获取tokenizer, BertTokenizer, 
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, cache_dir='tmp/token') 
# step 2: 获取预训练好的模型, BertModel
model = AutoModel.from_pretrained(MODEL_NAME, cache_dir='tmp/model')

 

预览下tokenizer ( 「transformers.tokenization_bert.BertTokenizer」
):

 

# 共28996词,包括特殊符号:('[UNK]', 100),('[PAD]', 0),('[CLS]', 101),('[SEP]', 102), ('[MASK]', 103)...
tokenizer.vocab 

 

看下 「model」
的网络结构:

 

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(28996, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (1): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (2): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (3): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (4): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (5): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (6): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (7): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (8): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (9): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (10): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (11): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): BertPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )
)

 

模型结构参考BertModel源码介绍部分。

 

5.1 Embeddings produced by pre-trained BertModel

 

text = "This is an input example"
# step 1: tokenize, including add special tokens
tokens_info = tokenizer.encode_plus(text, return_tensors="pt") 
for key, value in tokens_info.items():
    print("{}:\n\t{}".format(key, value))
# observe the enriched token sequences
print(tokenizer.convert_ids_to_tokens(tokens_info['input_ids'].squeeze(0).numpy()))
# step 2: BertModel Transformer Encoding
outputs, pooled = model(**tokens_info)
print("Token wise output: {}, Pooled output: {}".format(outputs.shape, pooled.shape))
'''
step 1: outputs:
-----------------------------------------------------------
input_ids:
 tensor([[ 101, 1188, 1110, 1126, 7758, 1859,  102]])
token_type_ids:
 tensor([[0, 0, 0, 0, 0, 0, 0]])
attention_mask:
 tensor([[1, 1, 1, 1, 1, 1, 1]])
['[CLS]', 'This', 'is', 'an', 'input', 'example', '[SEP]']
step 2: outputs:
------------------------------------------------------------
Token wise output: torch.Size([1, 7, 768]), Pooled output: torch.Size([1, 768])
'''

 

5.2 Predict the missing word in a sentence

 

from transformers import BertForMaskedLM
text = "Nice to [MASK] you" # target token using [MASK] to mask
# step 1: obtain pretrained Bert Model using MLM Loss
maskedLM_model = BertForMaskedLM.from_pretrained(MODEL_NAME, cache_dir='tmp/model')
maskedLM_model.eval() # close dropout
# step 2: tokenize
token_info = tokenizer.encode_plus(text, return_tensors='pt')
tokens = tokenizer.convert_ids_to_tokens(token_info['input_ids'].squeeze().numpy())
print(tokens) # ['[CLS]', 'Nice', 'to', '[MASK]', 'you', '[SEP]']
# step 3: forward to obtain prediction scores
with torch.no_grad():
    outputs = maskedLM_model(**token_info)
    predictions = outputs[0] # shape, B x S x V, [1, 6, 28996]
# step 4: top-k predicted tokens
masked_index = tokens.index('[MASK]') # 3
k = 10
probs, indices = torch.topk(torch.softmax(predictions[0, masked_index], -1), k)
predicted_tokens = tokenizer.convert_ids_to_tokens(indices.tolist())
print(list(zip(predicted_tokens, probs)))
'''
output:
[('meet', tensor(0.9712)),
 ('see', tensor(0.0267)),
 ('meeting', tensor(0.0010)),
 ('have', tensor(0.0003)),
 ('met', tensor(0.0002)),
 ('know', tensor(0.0001)),
 ('join', tensor(7.0005e-05)),
 ('find', tensor(5.8323e-05)),
 ('Meet', tensor(2.7171e-05)),
 ('tell', tensor(2.4689e-05))]
'''

 

可以看出,meet的概率最大,且达到了0.97,非常显着。

 

5.3 Extractive QA

 

展示sentence-pair level的下游任务。

 

from transformers import BertTokenizer, BertForQuestionAnswering
import torch
# step 1: obtain pretrained-model in SQUAD
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', cache_dir='tmp/token_qa')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad', cache_dir='tmp/model_qa')
# step 2: tokenize, sentence-pair, question, passage
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer.encode_plus(question, text, return_tensors='pt')
input_ids, token_type_ids = encoding["input_ids"], encoding["token_type_ids"]
print(input_ids, token_type_ids)
# observe enriched tokens
all_tokens = tokenizer.convert_ids_to_tokens(input_ids.squeeze().numpy())
print(all_tokens)
# step 3: obtain start/end position scores, B x S
start_scores, end_scores = model(input_ids, token_type_ids=token_type_ids) # (B, S)
answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
print(answer)
assert answer == "a nice puppet"
'''
output:
step 2:
   input_ids: tensor([[  101,  2040,  2001,  3958, 27227,  1029,   102,  3958, 27227,  2001, 1037,  3835, 13997,   102]]) 
   token_type_ids: tensor([[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]])
   all_tokens:
    ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', 'henson', 'was', 'a', 'nice', 'puppet', '[SEP]']   
step 3:
   answer:
   a nice puppet
'''

 

可以看出,模型能准确预测出答案, 「a nice puppet」

 

6. Summary

 

之前一直没有机会阅读BERT源码。这篇文章也算是对BERT源码的一次粗浅的阅读笔记。想要进一步学习的话,可以参考文章,进击的 BERT:NLP 界的巨人之力与迁移学习。总之,基于huggingface提供的transfomers进行二次开发和fine-tune还是比较方便的。下一次会尝试结合AllenNLP,在AllenNLP中使用transformers来解决NLP tasks。

 

7. References

 

Transformers: State-of-the-art Natural Language Processing

 

深入理解NLP Subword算法:BPE、WordPiece、ULM

 

huggingface transformers doc

 

huggingface transformers source code

 

进击的 BERT:NLP 界的巨人之力与迁移学习

 

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Be First to Comment

发表评论

电子邮件地址不会被公开。 必填项已用*标注