### 前置知识

https://zhuanlan.zhihu.com/p/384469908

yield介绍

https://www.runoob.com/w3cnote/python-yield-used-analysis.html

### Pipeline

#### 训练部分

① 运行build_dataset_tags.py将原始数据集处理为txt文本保存(生成原始数据集文本)

②数据流

token对应的id

token对应的tag

#### 数据迭代器

```# 计算batch数
if data['size'] % self.batch_size == 0:
BATCH_NUM = data['size']//self.batch_size
else:
BATCH_NUM = data['size']//self.batch_size + 1
# one pass over data
# 提取一个batch，由batch_size个sentences构成
for i in range(BATCH_NUM):
# fetch sentences and tags
if i * self.batch_size < data['size'] < (i+1) * self.batch_size:
sentences = [data['data'][idx] for idx in order[i*self.batch_size:]]
if not interMode:
tags = [data['tags'][idx] for idx in order[i*self.batch_size:]]
else:
sentences = [data['data'][idx] for idx in order[i*self.batch_size:(i+1)*self.batch_size]]
if not interMode:
tags = [data['tags'][idx] for idx in order[i*self.batch_size:(i+1)*self.batch_size]]```

—>计算batch中最大的句子长度—>将数据转换为np矩阵(numpy array)

```# prepare a numpy array with the data, initialising the data with pad_idx
# batch_data的形状为：最长句子长度X最长句子长度(batch_len X batch_len),元素全为0
batch_data = self.token_pad_idx * np.ones((batch_len, max_subwords_len))
batch_token_starts = []

# copy the data to the numpy array
for j in range(batch_len):
cur_subwords_len = len(sentences[j][0])
if cur_subwords_len <= max_subwords_len:
batch_data[j][:cur_subwords_len] = sentences[j][0]
else:
batch_data[j] = sentences[j][0][:max_subwords_len]
token_start_idx = sentences[j][-1]
token_starts = np.zeros(max_subwords_len)
token_starts[[idx for idx in token_start_idx if idx < max_subwords_len]] = 1
batch_token_starts.append(token_starts)
max_token_len = max(int(sum(token_starts)), max_token_len)```

#### —>将所有索引格式的（我理解就是numpy array形式的）数据转换为torch LongTensors

—>返回batch_data, batch_token_starts, batch_tags（这就是用于直接输入模型的数据）