BERT技术详细介绍!
文章目录
三. 训练集和验证集划分
四. 数据分词tokenizer
五. 定义数据读取(继承Dataset类)
六. 定义模型以及优化方法
七. 训练测试以及准确率
BERT技术详细介绍:
https://zhangkaifang.blog.csdn.net/article/details/120507302
本项目代码github链接:
https://github.com/zhangkaifang/NLP-Learning
BERT分类模型如下:
一. 数据集介绍
实验使用的数据集介绍(今日头条客户端)
:https://github.com/BenDerPan/toutiao-text-classfication-dataset
百度云链接:
https://pan.baidu.com/s/1yoUTdd91Dzv4c-WtHB9Teg
提取码: cnse
数据格式如下:每行为一条数据,以_!_
分割的个字段,从前往后分别是 新闻ID,分类code(见下文),分类名称(见下文),新闻字符串(仅含标题),新闻关键词
6552431613437805063_!_102_!_news_entertainment_!_谢娜为李浩菲澄清网络谣言,之后她的两个行为给自己加分_!_佟丽娅,网络谣言,快乐大本营,李浩菲,谢娜,观众们
分类code与名称:
100 民生 故事 news_story 101 文化 文化 news_culture 102 娱乐 娱乐 news_entertainment 103 体育 体育 news_sports 104 财经 财经 news_finance 106 房产 房产 news_house 107 汽车 汽车 news_car 108 教育 教育 news_edu 109 科技 科技 news_tech 110 军事 军事 news_military 112 旅游 旅游 news_travel 113 国际 国际 news_world 114 证券 股票 stock 115 农业 三农 news_agriculture 116 电竞 游戏 news_game
数据规模
:共382688条,分布于15个分类中。
# notebook中下载数据集 !wget https://mirror.coggle.club/dataset/toutiao_cat_data.txt.zip # 解压数据集 !unzip toutiao_cat_data.txt.zip # 查看前5行 !head -n 5 toutiao_cat_data.txt # 6551700932705387022_!_101_!_news_culture_!_京城最值得你来场文化之旅的博物馆_!_保利集团,马未都,中国科学技术馆,博物馆,新中国 # 6552368441838272771_!_101_!_news_culture_!_发酵床的垫料种类有哪些?哪种更好?_!_ # 6552407965343678723_!_101_!_news_culture_!_上联:黄山黄河黄皮肤黄土高原。怎幺对下联?_!_ # 6552332417753940238_!_101_!_news_culture_!_林徽因什幺理由拒绝了徐志摩而选择梁思成为终身伴侣?_!_ # 6552475601595269390_!_101_!_news_culture_!_黄杨木是什幺树?_!_
二. 数据读取
代码
# pandas 数据集读取,dataframe形式的 import pandas as pd # 文件读取 import codecs # 读取文本 # 标签 news_label = [int(x.split('_!_')[1])-100 for x in codecs.open('toutiao_cat_data.txt')] print(news_label[:5]) # 文本 news_text = [x.strip().split('_!_')[-1] if x.strip()[-3:] != '_!_' else x.strip().split('_!_')[-2] for x in codecs.open('toutiao_cat_data.txt')] print(news_text[:5])
结果:
[1, 1, 1, 1, 1] ['保利集团,马未都,中国科学技术馆,博物馆,新中国', '发酵床的垫料种类有哪些?哪种更好?', '上联:黄山黄河黄皮肤黄土高原。怎幺对下联?', '林徽因什幺理由拒绝了徐志摩而选择梁思成为终身伴侣?', '黄杨木是什幺树?']
三. 训练集和验证集划分
# 导入需要的环境包 import torch from sklearn.model_selection import train_test_split # 用于训练集和测试集划分 from torch.utils.data import Dataset, DataLoader, TensorDataset import numpy as np import pandas as pd import random import re
注意
:stratify 按照标签进行采样,训练集和验证部分同分布
# 划分为训练集和验证集 x_train, x_test, train_label, test_label = train_test_split(news_text[:50000], news_label[:50000], test_size=0.2, stratify=news_label[:50000]) # x_train, x_test, train_label, test_label = train_test_split(news_text[:50000], # news_label[:50000], # test_size=0.2, # random_state=10) tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')# 读取bert-base-chinese这个模型的token,就是词典。 sen_code = tokenizer.encode_plus('我不喜欢这世界', '我只喜欢你') # 两句话 print(sen_code) print(tokenizer.convert_ids_to_tokens(sen_code['input_ids'])) # input_ids:字的编码 # token_type_ids:标识是第一个句子还是第二个句子 # attention_mask:标识是不是填充
{ 'input_ids': [101, 2769, 679, 1599, 3614, 6821, 686, 4518, 102, 2769, 1372, 1599, 3614, 872, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]} ['[CLS]', '我', '不', '喜', '欢', '这', '世', '界', '[SEP]', '我', '只', '喜', '欢', '你', '[SEP]']
四. 数据分词tokenizer
注意:tokenizer()
等价于tokenizer.batch_encode_plus()
直接对整个数据集进行编码,一次编码所有的!
# pip install transformers # transformers bert相关的模型使用和加载 from transformers import BertTokenizer # 分词器,本质就是词典 tokenizer = BertTokenizer.from_pretrained('bert-base-chinese') train_encoding = tokenizer(x_train, truncation=True, padding=True, max_length=64) test_encoding = tokenizer(x_test, truncation=True, padding=True, max_length=64)
这里注意:train_encoding
返回的结果形式,这里假如样本总共有4个,2训练样本2个测试样本,打印print(train_encoding.items())
输出结果如下:
dict_items([ ('input_ids', [[101, 1355, 6997, 2414, 4638, 1807, 3160, 4905, 5102, 3300, 1525, 763, 8043, 1525, 4905, 3291, 1962, 8043, 102, 0, 0, 0, 0], [101, 677, 5468, 8038, 7942, 2255, 7942, 3777, 7942, 4649, 5502, 7942, 1759, 7770, 1333, 511, 2582, 720, 2190, 678, 5468, 8043, 102]]), ('token_type_ids', [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), ('attention_mask', [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]) ])
五. 定义数据读取(继承Dataset类)
# 数据集读取 class NewsDataset(Dataset): def __init__(self, encodings, labels): self.encodings = encodings self.labels = labels # 读取单个样本 def __getitem__(self, idx): # idx表示索引 item = { key: torch.tensor(val[idx]) for key, val in self.encodings.items()} item['labels'] = torch.tensor(int(self.labels[idx])) return item def __len__(self): return len(self.labels) train_dataset = NewsDataset(train_encoding, train_label) test_dataset = NewsDataset(test_encoding, test_label) print(train_dataset[1]) # 单个读取到批量读取 train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True) test_dataloader = DataLoader(test_dataset, batch_size=64, shuffle=True)
打印结果:
{ 'input_ids': tensor([ 101, 1075, 4343, 6121, 689, 117, 4343, 817, 117, 4495, 4343, 117, 817, 677, 3885, 117, 1075, 4343, 2787, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), 'labels': tensor(15)}
六. 定义模型以及优化方法
from transformers import BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup # 模型 model = BertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=17) # 类别个数 device = torch.device("cuda:3" if torch.cuda.is_available() else "cpu") model.to(device) # 优化方法 optim = AdamW(model.parameters(), lr=2e-5) total_steps = len(train_loader) * 1 scheduler = get_linear_schedule_with_warmup(optim, num_warmup_steps=0, # Default value in run_glue.py num_training_steps=total_steps)
七. 训练测试以及准确率
#################### 精度计算 def flat_accuracy(preds, labels): pred_flat = np.argmax(preds, axis=1).flatten() labels_flat = labels.flatten() return np.sum(pred_flat == labels_flat) / len(labels_flat) #################### 7. 训练函数 def train(epoch): model.train() total_train_loss = 0 iter_num = 0 # total_iter = len(train_loader) epoch_iterator = tqdm(train_loader, desc=f"Epoch {epoch}", ncols=100, leave=True, position=0) for batch in epoch_iterator: # 正向传播 optimizer.zero_grad() input_ids = batch['input_ids'].to(device) attention_mask = batch['attention_mask'].to(device) labels = batch['labels'].to(device) outputs = model(input_ids, attention_mask=attention_mask, labels=labels) loss = outputs[0] total_train_loss += loss.item() # 这里添加进度条的描述信息 epoch_iterator.set_description( f"epoch:{epoch} " + f"loss: {loss.item():.4f} " + f"lr: {scheduler.get_last_lr()[0]:.1e}") # 反向梯度信息 loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # 梯度裁剪 # 参数更新 optimizer.step() scheduler.step() iter_num += 1 # if (iter_num % 100 == 0): # print("epoch: %d, iter_num: %d, loss: %.4f, %.2f%%" % ( # epoch, iter_num, loss.item(), iter_num / total_iter * 100)) print("Epoch: %d, Average training loss: %.4f" % (epoch, total_train_loss / len(train_loader))) #################### 8. 验证函数 def validation(): model.eval() total_eval_accuracy = 0 total_eval_loss = 0 for batch in test_dataloader: with torch.no_grad(): # 正常传播 input_ids = batch['input_ids'].to(device) attention_mask = batch['attention_mask'].to(device) labels = batch['labels'].to(device) outputs = model(input_ids, attention_mask=attention_mask, labels=labels) loss = outputs[0] logits = outputs[1] total_eval_loss += loss.item() logits = logits.detach().cpu().numpy() label_ids = labels.to('cpu').numpy() total_eval_accuracy += flat_accuracy(logits, label_ids) avg_val_accuracy = total_eval_accuracy / len(test_dataloader) print("Accuracy: %.4f" % (avg_val_accuracy)) print("Average testing loss: %.4f" % (total_eval_loss / len(test_dataloader)))
八. 整个代码
此外提供了notebook版本代码,百度云:
https://pan.baidu.com/s/1ctbMU8CZfg3M_8hPXL96eQ
提取码: jotg
# !/usr/bin/env python # -*- encoding: utf-8 -*- """===================================== @author : kaifang zhang @time : 2021/12/17 10:28 AM @contact: [email protected] =====================================""" import codecs # 文件读取 import torch import random import numpy as np from tqdm import tqdm from sklearn.model_selection import train_test_split # 用于训练集和测试集划分 from transformers import BertTokenizer from torch.utils.data import Dataset, DataLoader from transformers import BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup random.seed(1001) #################### 1. 读取文本 news_label = [int(x.split('_!_')[1]) - 100 for x in codecs.open('toutiao_cat_data.txt')] # 逐行,获取标签 # print(news_label[:5]) news_text = [x.strip().split('_!_')[-1] if x.strip()[-3:] != '_!_' else x.strip().split('_!_')[-2] for x in codecs.open('toutiao_cat_data.txt')] # 文本 # print(news_text[:5]) #################### 2. 划分为训练集和验证集 x_train, x_test, train_label, test_label = train_test_split(news_text[:5000], news_label[:5000], test_size=0.2, stratify=news_label[:5000]) # stratify按照标签进行采样,训练集和验证部分同分布 #################### 3. tokenizer分词器,本质就是词典,对字进行编码 tokenizer = BertTokenizer.from_pretrained('bert-base-chinese') max_length = 64 train_encoding = tokenizer.batch_encode_plus(x_train, truncation=True, padding=True, max_length=max_length) test_encoding = tokenizer.batch_encode_plus(x_test, truncation=True, padding=True, max_length=max_length) #################### 4. 数据集读取, 把数据封装成Dataset对象 class NewsDataset(Dataset): def __init__(self, encodings, labels): self.encodings = encodings self.labels = labels # 读取单个样本 def __getitem__(self, idx): # idx表示索引 item = { key: torch.tensor(val[idx]) for key, val in self.encodings.items()} item['labels'] = torch.tensor(int(self.labels[idx])) return item def __len__(self): return len(self.labels) train_dataset = NewsDataset(train_encoding, train_label) test_dataset = NewsDataset(test_encoding, test_label) # print(train_dataset[1]) #################### 4. 单个数据读取到批量batch_size读取 batch_size = 16 train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True) test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True) #################### 4. 精度计算 def flat_accuracy(preds, labels): pred_flat = np.argmax(preds, axis=1).flatten() labels_flat = labels.flatten() return np.sum(pred_flat == labels_flat) / len(labels_flat) #################### 5. 定义分类模型 model = BertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=17) # num_labels类别数量 device = torch.device("cuda:2" if torch.cuda.is_available() else "cpu") model.to(device) #################### 6. 优化方法 optimizer = AdamW(model.parameters(), lr=2e-5) total_steps = len(train_loader) * 1 scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, # Default value in run_glue.py num_training_steps=total_steps) #################### 7. 训练函数 def train(epoch): model.train() total_train_loss = 0 iter_num = 0 # total_iter = len(train_loader) epoch_iterator = tqdm(train_loader, desc=f"Epoch { epoch}", ncols=100, leave=True, position=0) for batch in epoch_iterator: # 正向传播 optimizer.zero_grad() input_ids = batch['input_ids'].to(device) attention_mask = batch['attention_mask'].to(device) labels = batch['labels'].to(device) outputs = model(input_ids, attention_mask=attention_mask, labels=labels) loss = outputs[0] total_train_loss += loss.item() # 这里添加进度条的描述信息 epoch_iterator.set_description( f"epoch:{ epoch} " + f"loss: { loss.item():.4f} " + f"lr: { scheduler.get_last_lr()[0]:.1e}") # 反向梯度信息 loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # 梯度裁剪 # 参数更新 optimizer.step() scheduler.step() iter_num += 1 # if (iter_num % 100 == 0): # print("epoch: %d, iter_num: %d, loss: %.4f, %.2f%%" % ( # epoch, iter_num, loss.item(), iter_num / total_iter * 100)) print("Epoch: %d, Average training loss: %.4f" % (epoch, total_train_loss / len(train_loader))) #################### 8. 验证函数 def validation(): model.eval() total_eval_accuracy = 0 total_eval_loss = 0 for batch in test_dataloader: with torch.no_grad(): # 正常传播 input_ids = batch['input_ids'].to(device) attention_mask = batch['attention_mask'].to(device) labels = batch['labels'].to(device) outputs = model(input_ids, attention_mask=attention_mask, labels=labels) loss = outputs[0] logits = outputs[1] total_eval_loss += loss.item() logits = logits.detach().cpu().numpy() label_ids = labels.to('cpu').numpy() total_eval_accuracy += flat_accuracy(logits, label_ids) avg_val_accuracy = total_eval_accuracy / len(test_dataloader) print("Accuracy: %.4f" % (avg_val_accuracy)) print("Average testing loss: %.4f" % (total_eval_loss / len(test_dataloader))) if __name__ == '__main__': for epoch in range(4): print("------------Epoch: %d ----------------" % epoch) train(epoch) validation(epoch)
执行结果:
ssh://[email protected]:22/dataNew/kaifang/miniconda3/envs/torch/bin/python -u /dataNew/kaifang/0_Codes/测试项目/test.py Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight'] - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-chinese and are newly initialized: ['classifier.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. epoch:0 loss: 0.9334 lr: 8.0e-08: 100%|███████████████████████████| 250/250 [00:38<00:00, 6.50it/s] Epoch: 0, Average training loss: 1.4097 Accuracy: 0.8284 Average testing loss: 0.7634 epoch:1 loss: 0.5058 lr: 0.0e+00: 100%|███████████████████████████| 250/250 [00:37<00:00, 6.67it/s] Epoch: 1, Average training loss: 0.7357 Accuracy: 0.8304 Average testing loss: 0.7614
九. 参考
Bert源代码解读-以BERT文本分类代码为例子:https://github.com/DA-southampton/Read_Bert_Code
BERT大火却不懂Transformer?读这一篇就够了:https://zhuanlan.zhihu.com/p/54356280
pytorch 中加载BERT模型, 获取词向量:https://blog.csdn.net/znsoft/article/details/107725285
Bert提取句子特征(pytorch_transformers):https://blog.csdn.net/weixin_41519463/article/details/100863313
学习率预热(transformers.get_linear_schedule_with_warmup):https://blog.csdn.net/orangerfun/article/details/120400247
Be First to Comment