Press "Enter" to skip to content

『NLP学习笔记』BERT文本分类实战

BERT技术详细介绍!

文章目录

三. 训练集和验证集划分
四. 数据分词tokenizer
五. 定义数据读取(继承Dataset类)
六. 定义模型以及优化方法
七. 训练测试以及准确率

BERT技术详细介绍:
https://zhangkaifang.blog.csdn.net/article/details/120507302

本项目代码github链接:
https://github.com/zhangkaifang/NLP-Learning

BERT分类模型如下:

一. 数据集介绍

实验使用的数据集介绍(今日头条客户端)
https://github.com/BenDerPan/toutiao-text-classfication-dataset

百度云链接:
https://pan.baidu.com/s/1yoUTdd91Dzv4c-WtHB9Teg
提取码: cnse

数据格式如下:每行为一条数据,以_!_
分割的个字段,从前往后分别是 新闻ID,分类code(见下文),分类名称(见下文),新闻字符串(仅含标题),新闻关键词

6552431613437805063_!_102_!_news_entertainment_!_谢娜为李浩菲澄清网络谣言,之后她的两个行为给自己加分_!_佟丽娅,网络谣言,快乐大本营,李浩菲,谢娜,观众们

分类code与名称:

100 民生 故事 news_story
101 文化 文化 news_culture
102 娱乐 娱乐 news_entertainment
103 体育 体育 news_sports
104 财经 财经 news_finance
106 房产 房产 news_house
107 汽车 汽车 news_car
108 教育 教育 news_edu 
109 科技 科技 news_tech
110 军事 军事 news_military
112 旅游 旅游 news_travel
113 国际 国际 news_world
114 证券 股票 stock
115 农业 三农 news_agriculture
116 电竞 游戏 news_game

数据规模
:共382688条,分布于15个分类中。

# notebook中下载数据集
!wget https://mirror.coggle.club/dataset/toutiao_cat_data.txt.zip
# 解压数据集
!unzip toutiao_cat_data.txt.zip
# 查看前5行
!head -n 5 toutiao_cat_data.txt
# 6551700932705387022_!_101_!_news_culture_!_京城最值得你来场文化之旅的博物馆_!_保利集团,马未都,中国科学技术馆,博物馆,新中国
# 6552368441838272771_!_101_!_news_culture_!_发酵床的垫料种类有哪些?哪种更好?_!_
# 6552407965343678723_!_101_!_news_culture_!_上联:黄山黄河黄皮肤黄土高原。怎幺对下联?_!_
# 6552332417753940238_!_101_!_news_culture_!_林徽因什幺理由拒绝了徐志摩而选择梁思成为终身伴侣?_!_
# 6552475601595269390_!_101_!_news_culture_!_黄杨木是什幺树?_!_

 

二. 数据读取

代码

# pandas 数据集读取,dataframe形式的
import pandas as pd
# 文件读取
import codecs
# 读取文本
# 标签
news_label = [int(x.split('_!_')[1])-100 
                  for x in codecs.open('toutiao_cat_data.txt')]
print(news_label[:5])
# 文本
news_text = [x.strip().split('_!_')[-1] if x.strip()[-3:] != '_!_' else x.strip().split('_!_')[-2]
                 for x in codecs.open('toutiao_cat_data.txt')]
print(news_text[:5])

结果:

[1, 1, 1, 1, 1]
['保利集团,马未都,中国科学技术馆,博物馆,新中国', '发酵床的垫料种类有哪些?哪种更好?', '上联:黄山黄河黄皮肤黄土高原。怎幺对下联?', '林徽因什幺理由拒绝了徐志摩而选择梁思成为终身伴侣?', '黄杨木是什幺树?']

 

三. 训练集和验证集划分

 

# 导入需要的环境包
import torch
from sklearn.model_selection import train_test_split # 用于训练集和测试集划分
from torch.utils.data import Dataset, DataLoader, TensorDataset
import numpy as np
import pandas as pd
import random
import re

注意
:stratify 按照标签进行采样,训练集和验证部分同分布

# 划分为训练集和验证集
x_train, x_test, train_label, test_label =  train_test_split(news_text[:50000], 
                                                             news_label[:50000], 
                                                             test_size=0.2, 
                                                             stratify=news_label[:50000])
# x_train, x_test, train_label, test_label =  train_test_split(news_text[:50000], 
#                                                              news_label[:50000], 
#                                                              test_size=0.2, 
#                                                              random_state=10)
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')# 读取bert-base-chinese这个模型的token,就是词典。
sen_code = tokenizer.encode_plus('我不喜欢这世界', '我只喜欢你') # 两句话
print(sen_code)
print(tokenizer.convert_ids_to_tokens(sen_code['input_ids']))
# input_ids:字的编码
# token_type_ids:标识是第一个句子还是第二个句子
# attention_mask:标识是不是填充

 

{
 'input_ids': [101, 2769, 679, 1599, 3614, 6821, 686, 4518, 102, 2769, 1372, 1599, 3614, 872, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', '我', '不', '喜', '欢', '这', '世', '界', '[SEP]', '我', '只', '喜', '欢', '你', '[SEP]']

 

四. 数据分词tokenizer


注意:tokenizer()
等价于tokenizer.batch_encode_plus()
直接对整个数据集进行编码,一次编码所有的!

# pip install transformers
# transformers bert相关的模型使用和加载
from transformers import BertTokenizer
# 分词器,本质就是词典
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
train_encoding = tokenizer(x_train, truncation=True, padding=True, max_length=64)
test_encoding = tokenizer(x_test, truncation=True, padding=True, max_length=64)


这里注意:train_encoding
返回的结果形式,这里假如样本总共有4个,2训练样本2个测试样本,打印print(train_encoding.items())
输出结果如下:

dict_items([
('input_ids', [[101, 1355, 6997, 2414, 4638, 1807, 3160, 4905, 5102, 3300, 1525, 763, 8043, 1525, 4905, 3291, 1962, 8043, 102, 0, 0, 0, 0], 
               [101, 677, 5468, 8038, 7942, 2255, 7942, 3777, 7942, 4649, 5502, 7942, 1759, 7770, 1333, 511, 2582, 720, 2190, 678, 5468, 8043, 102]]), 
('token_type_ids', [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
('attention_mask', [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
])

 

五. 定义数据读取(继承Dataset类)

 

# 数据集读取
class NewsDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    # 读取单个样本
    def __getitem__(self, idx):  # idx表示索引
        item = {
 key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(int(self.labels[idx]))
        return item
    def __len__(self):
        return len(self.labels)
train_dataset = NewsDataset(train_encoding, train_label)
test_dataset = NewsDataset(test_encoding, test_label)
print(train_dataset[1])
# 单个读取到批量读取
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=64, shuffle=True)

打印结果:

{
 'input_ids': tensor([ 101, 1075, 4343, 6121,  689,  117, 4343,  817,  117, 4495, 4343,  117,
          817,  677, 3885,  117, 1075, 4343, 2787,  102,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0]),
 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 'labels': tensor(15)}

 

六. 定义模型以及优化方法

 

from transformers import BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
# 模型
model = BertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=17) # 类别个数
device = torch.device("cuda:3" if torch.cuda.is_available() else "cpu")
model.to(device)
# 优化方法
optim = AdamW(model.parameters(), lr=2e-5)
total_steps = len(train_loader) * 1
scheduler = get_linear_schedule_with_warmup(optim, num_warmup_steps=0,  # Default value in run_glue.py
                                            num_training_steps=total_steps)

 

七. 训练测试以及准确率

 

#################### 精度计算
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)
#################### 7. 训练函数
def train(epoch):
    model.train()
    total_train_loss = 0
    iter_num = 0
    # total_iter = len(train_loader)
    epoch_iterator = tqdm(train_loader, desc=f"Epoch {epoch}", ncols=100, leave=True, position=0)
    for batch in epoch_iterator:
        # 正向传播
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs[0]
        total_train_loss += loss.item()
        # 这里添加进度条的描述信息
        epoch_iterator.set_description(
            f"epoch:{epoch} " +
            f"loss: {loss.item():.4f} " +
            f"lr: {scheduler.get_last_lr()[0]:.1e}")
        # 反向梯度信息
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # 梯度裁剪
        # 参数更新
        optimizer.step()
        scheduler.step()
        iter_num += 1
        # if (iter_num % 100 == 0):
        #     print("epoch: %d, iter_num: %d, loss: %.4f, %.2f%%" % (
        #         epoch, iter_num, loss.item(), iter_num / total_iter * 100))
    print("Epoch: %d, Average training loss: %.4f" % (epoch, total_train_loss / len(train_loader)))
#################### 8. 验证函数
def validation():
    model.eval()
    total_eval_accuracy = 0
    total_eval_loss = 0
    for batch in test_dataloader:
        with torch.no_grad():
            # 正常传播
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs[0]
        logits = outputs[1]
        total_eval_loss += loss.item()
        logits = logits.detach().cpu().numpy()
        label_ids = labels.to('cpu').numpy()
        total_eval_accuracy += flat_accuracy(logits, label_ids)
    avg_val_accuracy = total_eval_accuracy / len(test_dataloader)
    print("Accuracy: %.4f" % (avg_val_accuracy))
    print("Average testing loss: %.4f" % (total_eval_loss / len(test_dataloader)))

 

八. 整个代码

此外提供了notebook版本代码,百度云:
https://pan.baidu.com/s/1ctbMU8CZfg3M_8hPXL96eQ
提取码: jotg

# !/usr/bin/env python
# -*- encoding: utf-8 -*-
"""=====================================
@author : kaifang zhang
@time   : 2021/12/17 10:28 AM
@contact: [email protected]
====================================="""
import codecs  # 文件读取
import torch
import random
import numpy as np
from tqdm import tqdm
from sklearn.model_selection import train_test_split  # 用于训练集和测试集划分
from transformers import BertTokenizer
from torch.utils.data import Dataset, DataLoader
from transformers import BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
random.seed(1001)
#################### 1. 读取文本
news_label = [int(x.split('_!_')[1]) - 100 for x in codecs.open('toutiao_cat_data.txt')]  # 逐行,获取标签
# print(news_label[:5])
news_text = [x.strip().split('_!_')[-1] if x.strip()[-3:] != '_!_' else x.strip().split('_!_')[-2]
             for x in codecs.open('toutiao_cat_data.txt')]  # 文本
# print(news_text[:5])
#################### 2. 划分为训练集和验证集
x_train, x_test, train_label, test_label = train_test_split(news_text[:5000],
                                                            news_label[:5000],
                                                            test_size=0.2,
                                                            stratify=news_label[:5000])  # stratify按照标签进行采样,训练集和验证部分同分布
#################### 3. tokenizer分词器,本质就是词典,对字进行编码
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
max_length = 64
train_encoding = tokenizer.batch_encode_plus(x_train, truncation=True, padding=True, max_length=max_length)
test_encoding = tokenizer.batch_encode_plus(x_test, truncation=True, padding=True, max_length=max_length)
#################### 4. 数据集读取, 把数据封装成Dataset对象
class NewsDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    # 读取单个样本
    def __getitem__(self, idx):  # idx表示索引
        item = {
 key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(int(self.labels[idx]))
        return item
    def __len__(self):
        return len(self.labels)
train_dataset = NewsDataset(train_encoding, train_label)
test_dataset = NewsDataset(test_encoding, test_label)
# print(train_dataset[1])
#################### 4. 单个数据读取到批量batch_size读取
batch_size = 16
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)
#################### 4. 精度计算
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)
#################### 5. 定义分类模型
model = BertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=17)  # num_labels类别数量
device = torch.device("cuda:2" if torch.cuda.is_available() else "cpu")
model.to(device)
#################### 6. 优化方法
optimizer = AdamW(model.parameters(), lr=2e-5)
total_steps = len(train_loader) * 1
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0,  # Default value in run_glue.py
                                            num_training_steps=total_steps)
#################### 7. 训练函数
def train(epoch):
    model.train()
    total_train_loss = 0
    iter_num = 0
    # total_iter = len(train_loader)
    epoch_iterator = tqdm(train_loader, desc=f"Epoch {
   epoch}", ncols=100, leave=True, position=0)
    for batch in epoch_iterator:
        # 正向传播
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs[0]
        total_train_loss += loss.item()
        # 这里添加进度条的描述信息
        epoch_iterator.set_description(
            f"epoch:{
   epoch} " +
            f"loss: {
   loss.item():.4f} " +
            f"lr: {
   scheduler.get_last_lr()[0]:.1e}")
        # 反向梯度信息
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # 梯度裁剪
        # 参数更新
        optimizer.step()
        scheduler.step()
        iter_num += 1
        # if (iter_num % 100 == 0):
        #     print("epoch: %d, iter_num: %d, loss: %.4f, %.2f%%" % (
        #         epoch, iter_num, loss.item(), iter_num / total_iter * 100))
    print("Epoch: %d, Average training loss: %.4f" % (epoch, total_train_loss / len(train_loader)))
#################### 8. 验证函数
def validation():
    model.eval()
    total_eval_accuracy = 0
    total_eval_loss = 0
    for batch in test_dataloader:
        with torch.no_grad():
            # 正常传播
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs[0]
        logits = outputs[1]
        total_eval_loss += loss.item()
        logits = logits.detach().cpu().numpy()
        label_ids = labels.to('cpu').numpy()
        total_eval_accuracy += flat_accuracy(logits, label_ids)
    avg_val_accuracy = total_eval_accuracy / len(test_dataloader)
    print("Accuracy: %.4f" % (avg_val_accuracy))
    print("Average testing loss: %.4f" % (total_eval_loss / len(test_dataloader)))
if __name__ == '__main__':
    for epoch in range(4):
        print("------------Epoch: %d ----------------" % epoch)
        train(epoch)
        validation(epoch)

执行结果:

ssh://[email protected]:22/dataNew/kaifang/miniconda3/envs/torch/bin/python -u /dataNew/kaifang/0_Codes/测试项目/test.py
Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-chinese and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
epoch:0 loss: 0.9334 lr: 8.0e-08: 100%|███████████████████████████| 250/250 [00:38<00:00,  6.50it/s]
Epoch: 0, Average training loss: 1.4097
Accuracy: 0.8284
Average testing loss: 0.7634
epoch:1 loss: 0.5058 lr: 0.0e+00: 100%|███████████████████████████| 250/250 [00:37<00:00,  6.67it/s]
Epoch: 1, Average training loss: 0.7357
Accuracy: 0.8304
Average testing loss: 0.7614

 

九. 参考

Bert源代码解读-以BERT文本分类代码为例子:https://github.com/DA-southampton/Read_Bert_Code

BERT大火却不懂Transformer?读这一篇就够了:https://zhuanlan.zhihu.com/p/54356280

pytorch 中加载BERT模型, 获取词向量:https://blog.csdn.net/znsoft/article/details/107725285

Bert生成句向量(pytorch)

Bert提取句子特征(pytorch_transformers):https://blog.csdn.net/weixin_41519463/article/details/100863313

学习率预热(transformers.get_linear_schedule_with_warmup):https://blog.csdn.net/orangerfun/article/details/120400247

Be First to Comment

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注