，以及每个算法用来生成最终标记集的合并策略

## Unigram算法–一个基于概率的模型

Unigram是一种完全的概率算法，它在每次迭代中都会根据概率选择字符对和最终决定合并（或不合并）。

## WordPiece算法

WordPiece也是一种贪婪的算法，它利用可能性而不是计数频率来合并每个迭代中的最佳配对，但选择配对的字符是基于计数频率的。

## 如何训练BPE、Unigram和WordPiece算法

### 如何训练数据集

`!wget http://www.gutenberg.org/cache/epub/16457/pg16457.txt`

`!wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip`

`!unzip wikitext-103-raw-v1.zip`

### 导入所需的模型和培训师

```## importing the tokenizer and subword BPE trainer
from tokenizers import Tokenizer
from tokenizers.models import BPE, Unigram, WordLevel, WordPiece
from tokenizers.trainers import BpeTrainer, WordLevelTrainer, \
WordPieceTrainer, UnigramTrainer
## a pretokenizer to segment the text into words
from tokenizers.pre_tokenizers import Whitespace```

### 如何实现训练和标记化的自动化

#### 第1步 – 准备标记器

`BpeTrainer`

`, WordLevelTrainer, WordPieceTrainer, and UnigramTrainer.`

```unk_token = "<UNK>"  # token for unknown words
spl_tokens = ["<UNK>", "<SEP>", "<MASK>", "<CLS>"]  # special tokens
def prepare_tokenizer_trainer(alg):
"""
Prepares the tokenizer and trainer with unknown & special tokens.
"""
if alg == 'BPE':
tokenizer = Tokenizer(BPE(unk_token = unk_token))
trainer = BpeTrainer(special_tokens = spl_tokens)
elif alg == 'UNI':
tokenizer = Tokenizer(Unigram())
trainer = UnigramTrainer(unk_token= unk_token, special_tokens = spl_tokens)
elif alg == 'WPC':
tokenizer = Tokenizer(WordPiece(unk_token = unk_token))
trainer = WordPieceTrainer(special_tokens = spl_tokens)
else:
tokenizer = Tokenizer(WordLevel(unk_token = unk_token))
trainer = WordLevelTrainer(special_tokens = spl_tokens)

tokenizer.pre_tokenizer = Whitespace()

**令牌，因为这两个词经常挨着出现。

)。你可以选择用其他人来测试它。

#### 第2步 – 训练标记器

```‘WLV’
‘WPC’
‘BPE’
‘UNI’```

```def train_tokenizer(files, alg='WLV'):
"""
Takes the files and trains the tokenizer.
"""
tokenizer, trainer = prepare_tokenizer_trainer(alg)
tokenizer.train(files, trainer) # training the tokenzier
tokenizer.save("./tokenizer-trained.json")
tokenizer = Tokenizer.from_file("./tokenizer-trained.json")

#### 第3步 – 对输入字符串进行标记

```small_file = ['pg16457.txt']
large_files = [f"./wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
for files in [small_file, large_files]:
print(f"========Using vocabulary from {files}=======")
for alg in ['WLV', 'BPE', 'UNI', 'WPC']:
trained_tokenizer = train_tokenizer(files, alg)
input_string = "This is a deep learning tokenization tutorial. Tokenization is the first step in a deep learning NLP pipeline. We will be comparing the tokens generated by each tokenization model. Excited much?!:heart_eyes:"
output = tokenize(input_string, trained_tokenizer)
tokens_dict[alg] = output.tokens
print("----", alg, "----")
print(output.tokens, "->", len(output.tokens))```

## 对输出的分析。

BPE

Unigram模型

WordPiece在较小的数据集上训练时创造了52个标记，在较大的数据集上训练时创造了48个标记。生成的代币有双#，表示代币作为前缀/后缀的使用。

## 如何比较令牌

```import pandas as pd
max_len = max(len(tokens_dict['UNI']), len(tokens_dict['WPC']), len(tokens_dict['BPE']))
diff_bpe = max_len - len(tokens_dict['BPE'])
diff_wpc = max_len - len(tokens_dict['WPC'])
del tokens_dict['WLV']
df = pd.DataFrame(tokens_dict)

## 结束语和后续步骤

### 参考资料和说明

1. 》，作者Taku Kudo

1. – 研究论文，讨论了基于BPE压缩算法的不同分割技术。

Hugging Face的tokenizer软件包。