有大量与文本信息分析相关的任务。例如:垃圾邮件或非垃圾邮件分类,分析文本的音调,对话系统等。
有时,原始任务可以简化为给定观察集上的简单 ML 分类任务
,其中是特征向量,是第个对象的类。在此类任务中,特征可能不仅包含
作为数字或分类值,还包含源文本(例如推文或问题)。
Text是一系列符号组成的字符串文本信息,其中称为字母表,例如可以是一组英文字母、Unicode 符号或可以具有更复杂的结构,包括符号序列,也称为token,例如可以是一组英文单词或 Emoji。
注意
文本特征也不能包含NaN值,需要手动将它们转换成字符串。
使用仅在 GPU 上支持的文本特征进行训练。
训练只能在分类损失和目标的情况下进行。
本文预处理
通常我们将文本作为一系列 Unicode 符号。因此,如果任务不是 DNA 分类,我们不需要这种粒度,此外,我们需要提取更复杂的实体,例如字。从序列中提取符号(单词、数字、标点符号或特殊符号)的过程称为 tokenization 。
tokenization
是 CatBoost 中文本预处理的第一部分,作为在字符串(如空格)上的简单拆分序列执行。
例子
text_small = [ "Cats are so cute :)", "Mouse skare...", "The cat defeated the mouse", "Cute: Mice gather an army!", "Army of mice defeated the cat :(", "Cat offers peace", "Cat is skared :(", "Cat and mouse live in peace :)" ] target_small = [1, 0, 1, 1, 0, 1, 0, 1] from catboost.text_processing import Tokenizer simple_tokenizer = Tokenizer() def tokenize_texts(texts): return [simple_tokenizer.tokenize(text) for text in texts] tokenized_text = tokenize_texts(text_small) tokenized_text
[['Cats', 'are', 'so', 'cute', ':)'], ['Mouse', 'skare...'], ['The', 'cat', 'defeated', 'the', 'mouse'], ['Cute:', 'Mice', 'gather', 'an', 'army!'], ['Army', 'of', 'mice', 'defeated', 'the', 'cat', ':('], ['Cat', 'offers', 'peace'], ['Cat', 'is', 'skared', ':('], ['Cat', 'and', 'mouse', 'live', 'in', 'peace', ':)']]
注意,虽然在实例化 Tokenizer 时设置其参数 languages='chinese'
,也可以处理中文文本,但对于处理中文并不是那幺友好,可以尝试其他方法处理中文文本信息。
标点处理、大小写转换、词形还原
仔细看看小文本示例的tokenization结果——标记包含很多错误:
它们用标点符号 ‘Cute:’, ‘army!’, ‘skare…’ 连在一起。
‘Cat’ 和 ‘cat’, ‘Mice’ 和 ‘mice’ 这两个词似乎有相同的含义,也许它们应该是相同的token。
标记 ‘are’/’is’ 存在同样的问题——它们是相同标记 ‘be’ 的变形形式。
标点处理和 词形还原 过程有助于解决这些问题。
标点处理
根据任务,标点处理过程可能:
彻底删除所有标点符号。
用空格转义。
保持原样(例如,用于更复杂的token集)。
例子
tokenizer = Tokenizer( lowercasing=True, separator_type='BySense', token_types=['Word', 'Number', 'Punctuation'] ) text_small_spaced = [' '.join(tokenizer.tokenize(text)) for text in text_small] text_small_spaced
['cats are so cute :)', 'mouse skare ...', 'the cat defeated the mouse', 'cute : mice gather an army !', 'army of mice defeated the cat :(', 'cat offers peace', 'cat is skared :(', 'cat and mouse live in peace :)']
去除停顿词
停用词- 在此任务中被认为是无用的词,例如功能词 * the、is、at、which、on
*。
通常在文本预处理期间会删除停用词,减少算法考虑的信息量。停用词是手动收集的(以字典形式)或自动收集,例如获取最常用的词。
stop_words = ['be', 'is', 'are', 'the', 'an', 'of', 'and', 'in'] def remove_words(texts, words): texts_copy = [] words_set = set(words) for text in tokenize_texts(texts): text_copy = [] for token in text: if token not in words_set: text_copy.append(token) texts_copy.append(' '.join(text_copy)) return texts_copy text_small_no_stop = remove_words(text_small_spaced, stop_words) text_small_no_stop
['cats so cute :)', 'mouse skare ...', 'cat defeated mouse', 'cute : mice gather army !', 'army mice defeated cat :(', 'cat offers peace', 'cat skared :(', 'cat mouse live peace :)']
词形还原
词元 —— 是一组单词的规范形式、字典形式或引用形式。例如,词元 “go” 表示 “go”、”goes”、”going”、”went” 和 “gone” 的变形形式。将单词转换为其引理的过程称为 词形还原 。
from pattern.en import lemma def lemmatize_text(text): return " ".join([lemma(word) for word in text.decode('utf-8').split()]) def lemmatize_texts(texts): return [lemmatize_text(text) for text in texts] text_small_lemmatized = lemmatize_texts(text_small_no_stop) text_small_lemmatized = tokenize_texts(text_small_lemmatized) text_small_lemmatized
[['cat', 'so', 'cute', ':)'], ['mouse', 'skare', '...'], ['cat', 'defeat', 'mouse'], ['cute', ':', 'mice', 'gather', 'army', '!'], ['army', 'mice', 'defeat', 'cat', ':('], ['cat', 'offer', 'peace'], ['cat', 'skare', ':('], ['cat', 'mouse', 'live', 'peace', ':)']]
现在用同一个 token
表示同义词, token
不含标点符号。
应该为自己的任务进行验证:是否真的有必要删除标点符号、小写句子或执行词形还原和/或 token
标记化?
上述结果中, token 'mice'/'mouse'
仍然存在问题,下面使用 gensim lemmatizer
处理他。
from gensim.utils import lemmatize def lemmatize_text_gensim(text): result = [] for token in simple_tokenizer.tokenize(text): lemmas = lemmatize(token) if len(lemmas) == 0: lemma = token.lower() else: lemma = lemmas[0].decode('utf-8').split('/')[0] result.append(lemma) return ' '.join(result)
检查准确性
使用新的文本预处理来检查准确性。由于 CatBoost 中不能进行空格标点、小写字母和词形还原等操作,我们需要先手动对文本进行预处理,然后再将其传递给学习算法。
新数据准备
import pandas as pd import numpy as np from catboost import Pool, CatBoostClassifier from catboost.datasets import rotten_tomatoes learn, _ = rotten_tomatoes()
auxiliary_columns = ['id', 'theater_date', 'dvd_date', 'rating', 'date'] cat_features = ['rating_MPAA', 'studio', 'fresh', 'critic', 'top_critic', 'publisher'] text_features = ['synopsis', 'genre', 'director', 'writer', 'review'] def get_processed_rotten_tomatoes(): learn, test = rotten_tomatoes() def fill_na(df, features): for feature in features: df[feature].fillna('', inplace=True) def preprocess_data_part(data_part): data_part = data_part.drop(auxiliary_columns, axis=1) fill_na(data_part, cat_features) fill_na(data_part, text_features) X = data_part.drop(['rating_10'], axis=1) y = data_part['rating_10'] return X, y X_learn, y_learn = preprocess_data_part(learn) X_test, y_test = preprocess_data_part(test) return X_learn, X_test, y_learn, y_test X_train, X_test, y_train, y_test = get_processed_rotten_tomatoes()
预处理
由于自然文本特征只是 'synopsis'
和 'review'
,我们将只对它们进行预处理。
def preprocess_data(X): X_preprocessed = X.copy() for feature in ['synopsis', 'review']: X_preprocessed[feature] = X[feature].apply(lambda x: lemmatize_text(' '.join(tokenizer.tokenize(x)))) return X_preprocessed X_preprocessed_train = preprocess_data(X_train) X_preprocessed_test = preprocess_data(X_test) X_preprocessed_train['synopsis'].head(10)
fit_catboost_on_rotten_tomatoes(X_preprocessed_train, X_preprocessed_test, y_train, y_test)
字典
在前面完成了文本预处理和标记化之后,现在开始使用准备好的文本来选择一组单元,这些单元将用于构建新的数值特征。
一组被选中的单元叫做字典,它可能包含单词、单词字母组合或字符。
例子
为小文本示例构建一个字典:
def build_dictionary(tokenized_texts): dictionary = {} for text in tokenized_texts: for token in text: if token not in dictionary: size = len(dictionary) dictionary[token] = size return dictionary def print_dictionary(dictionary, n_items=5): dict_items = sorted(dictionary.items(), key=lambda x: x[1]) for i in range(n_items): word, word_id = dict_items[i] print('word="{}" has'.format(word, word_id)) print('...') dictionary = build_dictionary(text_small_lemmatized) print_dictionary(dictionary)
word="cat" has word="so" has word="cute" has word=":)" has word="mouse" has ...
转换成固定大小的向量
大多数经典 ML 算法都在对固定数量的特征进行计算和预测。这意味着学习集包含向量,其中是常数。
由于文本对象不是定长向量,我们需要对原点集进行预处理。最简单的文本到矢量编码技术之一是 **词袋 (BoW)**。
词袋算法
该算法接受字典和文本。在算法文本转换为向量, 其中是从字典中出现的单词到文本。
def bag_of_words(texts, dictionary): encoded_vectors = [] dictionary_size = len(dictionary) for text in texts: vector = [0] * dictionary_size for token in text: if token in dictionary: token_id = dictionary[token] vector[token_id] = 1 encoded_vectors.append(vector) return encoded_vectors def print_bow_features(bag_of_words, dictionary): sorted_dict = sorted(dictionary.items(), key=lambda x: x[1]) keys = [x[0] for x in sorted_dict] bow_df = pd.DataFrame(data=bag_of_words, columns=keys) print(bow_df) bow_features = bag_of_words(text_small_lemmatized, dictionary) print_bow_features(bow_features, dictionary)
上下滑动查看更多
cat so cute :) mouse skare ... defeat : mice gather army ! :( \ 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 2 1 0 0 0 1 0 0 1 0 0 0 0 0 0 3 0 0 1 0 0 0 0 0 1 1 1 1 1 0 4 1 0 0 0 0 0 0 1 0 1 0 1 0 1 5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 6 1 0 0 0 0 1 0 0 0 0 0 0 0 1 7 1 0 0 1 1 0 0 0 0 0 0 0 0 0 offer peace live 0 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 1 1 0 6 0 0 0 7 0 1 1
例如,有了这样的向量,我们可以拟合线性或朴素贝叶斯模型。
from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import MultinomialNB from scipy.sparse import csr_matrix def fit_linear_model(X, c): model = LogisticRegression() model.fit(X, c) return model def fit_naive_bayes(X, c): clf = MultinomialNB() if isinstance(X, csr_matrix): X.eliminate_zeros() clf.fit(X, c) return clf linear_model = fit_linear_model(bow_features, target_small) naive_bayes = fit_naive_bayes(bow_features, target_small)
/home/d-kruchinin/.local/lib/python2.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning)
from sklearn.metrics import log_loss def evaluate_model_logloss(model, X, c): c_pred = model.predict_proba(X)[:,1] metric = log_loss(c, c_pred) print('Logloss: ' + str(metric)) print('Linear model') evaluate_model_logloss(linear_model, bow_features, target_small) print('Naive bayes') evaluate_model_logloss(naive_bayes, bow_features, target_small) print('Comparing to constant prediction') logloss_constant_prediction = log_loss(target_small, np.ones(shape=(len(text_small), 2)) * 0.5) print('Logloss: ' + str(logloss_constant_prediction))
Linear model Logloss: 0.3314294362422291 Naive bayes Logloss: 0.1667380176962438 Comparing to constant prediction Logloss: 0.6931471805599453
查看字母/单词的序列
例子文本 'The cat defeated the mouse'
和 'Army of mice defeated the cat'
,每个句子都用三个token来简化它: 'cat defeat mouse'
和 'mouse defeat cat'
。应用 BoW 后,我们得到两个相等的具有相反含义的向量:
cat | mouse | defeat |
---|---|---|
1 | 1 | 1 |
1 | 1 | 1 |
那幺如何区分它们?继续将单词序列作为单个token添加到我们的字典中:
cat | mouse | defeat | cat_defeat | mouse_defeat | defeat_cat | defeat_mouse |
---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 0 | 0 | 1 |
1 | 1 | 1 | 0 | 1 | 1 | 0 |
N-gram是来自给定文本或语音样本的个items的连续序列。在上面的例子中 Bi-gram (Bigram) = 2-gram of words
。
N-grams有助于将更多关于文本结构的信息添加到向量中,此外, n-grams
在分离中没有意义,例如, 'Mickey Mouse company'
。
例子
def build_bigram_dictionary(tokenized_texts): dictionary = {} for text in tokenized_texts: for i in range(len(text) - 1): token1, token2 = text[i], text[i + 1] bigram = token1 + ' ' + token2 if bigram not in dictionary: dictionary_size = len(dictionary) dictionary[bigram] = dictionary_size return dictionary bigram_word_dictionary = build_bigram_dictionary(text_small_lemmatized) print_dictionary(bigram_word_dictionary)
word="cat so" has word="so cute" has word="cute :)" has word="mouse skare" has word="skare ..." has ...
CatBoost 中的字典
要指定在 CatBoost 中创建哪种类型的字典,需要传递参数 dictionaries
。此参数指定在文本预处理过程中计算的所有字典。
字典参数指定为字符串列表,每个字符串是字典的描述,格式如下: 'DictionaryName:[Param1=Value1,[Param2=Value2]]'
以下是所有参数的列表:
min_token_occurrence
— 数量;输入字典的最小token出现次数
max_dict_size
— 数字;最大字典大小
token_level_type Word Letter
gram_order
— 编号;构建 n-gram 字典。
参数对模型的影响
min_token_occurrence
— 参数对于过滤太稀有的token非常有用,这有助于避免过度拟合。
max_dict_size
— 参数可以帮助控制模型的大小。
fit_catboost_on_rotten_tomatoes( X_preprocessed_train, X_preprocessed_test, y_train, y_test, catboost_params={ 'dictionaries': [ 'Word:min_token_occurrence=5', 'BiGram:gram_order=2' ], 'text_processing': [ 'NaiveBayes+Word|BoW+Word,BiGram' ] } )
上下滑动查看更多
0: learn: 0.3855466 test: 0.3940580 best: 0.3940580 (0) total: 107ms remaining: 1m 46s 100: learn: 0.4497432 test: 0.4521335 best: 0.4529894 (97) total: 4.49s remaining: 39.9s 200: learn: 0.4622463 test: 0.4624037 best: 0.4637486 (189) total: 8.5s remaining: 33.8s 300: learn: 0.4705307 test: 0.4636264 best: 0.4639932 (299) total: 12.5s remaining: 29.1s 400: learn: 0.4780509 test: 0.4653381 best: 0.4671720 (339) total: 16.6s remaining: 24.8s 500: learn: 0.4839203 test: 0.4666830 best: 0.4680279 (466) total: 20.7s remaining: 20.6s 600: learn: 0.4906151 test: 0.4702286 best: 0.4707177 (592) total: 24.9s remaining: 16.5s 700: learn: 0.4963928 test: 0.4703509 best: 0.4714513 (651) total: 29.2s remaining: 12.5s 800: learn: 0.5022316 test: 0.4724294 best: 0.4729184 (795) total: 33.6s remaining: 8.34s 900: learn: 0.5072450 test: 0.4740188 best: 0.4746302 (882) total: 38s remaining: 4.17s 999: learn: 0.5120751 test: 0.4749969 best: 0.4758528 (946) total: 42.2s remaining: 0us bestTest = 0.4758527937 bestIteration = 946 Shrink model to first 947 iterations.
CatBoost 中的特征计算
由于文本被转换为一系列标记索引,因此此信息允许 CatBoost 计算不同的数字特征:
词袋:0/1 特征(文本样本有或没有token_id),产生的数字特征数=字典大小。
NaiveBayes:多项朴素贝叶斯模型,产生的特征数量等于类的数量。
BM25也是在线计算的,它是搜索引擎用于排名目的来估计文档相关性的函数。
可以在参数 "text_processing"
中指定要计算的特征。
参数 text_processing
参数 text_processing 指定如何预处理文本特征。
文本处理参数指定为字符串列表,每个字符串是特征预处理的描述,格式如下:
'FeatureId~[FeatureEstimator1+DictionaryName1[|FeatureEstimator2+DictionaryName2]]'
示例: '0~BoW+Word|NaiveBayes+Word,Bigram'
,
在第 0 个文本特征 BoW
和 NaiveBayes
特征将使用 Word
和 Word,Bigram
字典相应地计算。也可以指定 default~...
文本特征(或空FeatureId),这是所有文本特征将使用相同的程序进行预处理,在参数中指定。
字典名称取自 dictionaries
参数。
还可以为估算器指定参数,例如对于词袋,可以指定参数”top_tokens_count”,该参数设置词袋中用于向量化的最大标记数,采用最频繁的标记。参数 top_tokens_count
高度影响 BoW 估计器中的 CPU 和 GPU RAM 使用率 。
fit_catboost_on_rotten_tomatoes( X_preprocessed_train, X_preprocessed_test, y_train, y_test, catboost_params={ 'dictionaries': [ 'Word:min_token_occurrence=5', 'BiGram:gram_order=2' ], 'text_processing': [ 'NaiveBayes+Word|BoW:top_tokens_count=1000+Word,BiGram|BM25+Word' ] } )
上下滑动查看更多
0: learn: 0.3985388 test: 0.4054285 best: 0.4054285 (0) total: 94.5ms remaining: 1m 34s 100: learn: 0.4494987 test: 0.4534784 best: 0.4534784 (100) total: 4.22s remaining: 37.5s 200: learn: 0.4620934 test: 0.4587358 best: 0.4593471 (196) total: 8.16s remaining: 32.5s 300: learn: 0.4711727 test: 0.4635041 best: 0.4639932 (265) total: 12.1s remaining: 28.1s 400: learn: 0.4797628 test: 0.4675388 best: 0.4681501 (399) total: 16.1s remaining: 24s 500: learn: 0.4874052 test: 0.4691283 best: 0.4696173 (496) total: 20.1s remaining: 20s 600: learn: 0.4942834 test: 0.4710845 best: 0.4719403 (596) total: 24s remaining: 15.9s 700: learn: 0.5011617 test: 0.4709622 best: 0.4719403 (596) total: 27.8s remaining: 11.9s 800: learn: 0.5069393 test: 0.4708400 best: 0.4719403 (596) total: 31.8s remaining: 7.89s 900: learn: 0.5133284 test: 0.4716958 best: 0.4723071 (836) total: 35.7s remaining: 3.92s 999: learn: 0.5190144 test: 0.4721849 best: 0.4732852 (950) total: 39.6s remaining: 0us bestTest = 0.4732852427 bestIteration = 950 Shrink model to first 951 iterations.
总结: CatBoost 中的文本特征
算法:
- 将输入文本加载为通常的列。
text_column: [string]
- 。
- 每个文本样本通过空格分割进行标记。
tokenized_column: [[string]]
- 。
字典估计。
将已标记列中的每个字符串都从字典转换为 token_id。text: [[token_id]]
。
CatBoost 根据参数”词袋、多项式朴素贝叶斯或 Bm25″,并基于文本列结果生成特征。
计算出的浮点特征被传递到通常的 CatBoost 学习算法中。
method description | Accuracy |
---|---|
Without text features | 0.4562 |
With unpreprocessed text features | 0.4707 |
After punctuation handling and lemmatization (only review column) | 0.4719 |
After adding bigrams | 0.4759 |
与经典方法的简化比较
经典方法:朴素贝叶斯和逻辑回归
只取一个文本列来比较文本分类的效果。
上下滑动查看更多源码
X_train_one_column = pd.DataFrame(X_preprocessed_train['review']) X_test_one_column = pd.DataFrame(X_preprocessed_test['review']) def fit_catboost_one_column(X_train, X_test, y_train, y_test, catboost_params={}, verbose=0): learn_pool = Pool(X_train, y_train, text_features=[0]) test_pool = Pool(X_test, y_test, text_features=[0]) catboost_default_params = { 'iterations': 1000, 'learning_rate': 0.03, 'eval_metric': 'Accuracy', 'task_type': 'GPU' } catboost_default_params.update(catboost_params) model = CatBoostClassifier(**catboost_default_params) model.fit(learn_pool, eval_set=test_pool, verbose=verbose) return model from sklearn.metrics import accuracy_score from sklearn.feature_extraction.text import CountVectorizer def vectorize(X, params): vectorizer = CountVectorizer(**params) vectorizer.fit(X) return vectorizer.transform(X), vectorizer def eval_accuracy(model, X, c): c_pred = model.predict(X) return accuracy_score(c_pred, c) def fit_and_compute_accuracy(X_train, X_test, y_train, y_test, vectorizer_params={}, catboost_params={}): X_train_bow, vectorizer = vectorize(X_train.iloc[:,0], vectorizer_params) X_test_bow = vectorizer.transform(X_test.iloc[:,0]) print('fitting linear model') linear_model = fit_linear_model(X_train_bow, y_train) print('fitting naive bayes model') naive_bayes = fit_naive_bayes(X_train_bow, y_train) print('fitting catboost model') cb_model = fit_catboost_one_column(X_train, X_test, y_train, y_test, catboost_params) linear_accuracy = eval_accuracy(linear_model, X_test_bow, y_test) naive_bayes_accuracy = eval_accuracy(naive_bayes, X_test_bow, y_test) cb_accuracy = eval_accuracy(cb_model, X_test, y_test) results = pd.DataFrame( data=[linear_accuracy, naive_bayes_accuracy, cb_accuracy], index=['Linear model', 'Naive bayes', 'CatBoost'], columns=['Accuracy'] ) print(results)
没有二元组的实验
fit_and_compute_accuracy( X_train_one_column, X_test_one_column, y_train, y_test, catboost_params = { 'dictionaries': ['Word:token_level_type=Word,min_token_occurrence=5'], 'text_processing': ['NaiveBayes+Word|BoW+Word'] } )
fitting linear model /home/d-kruchinin/.local/lib/python2.7/site-packages/sklearn/linear_model/logistic.py:460: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning. "this warning.", FutureWarning) fitting naive bayes model fitting catboost model Accuracy Linear model 0.292945 Naive bayes 0.301871 CatBoost 0.325223
有二元组的实验
fit_and_compute_accuracy( X_train_one_column, X_test_one_column, y_train, y_test, vectorizer_params = {'ngram_range': (1, 2)}, catboost_params = { 'dictionaries': [ 'Word:token_level_type=Word,min_token_occurrence=5', 'BiGram:gram_order=2,min_token_occurrence=4' ], 'text_processing': ['NaiveBayes+Word,BiGram|BoW+Word,BiGram'] } )
fitting linear model fitting naive bayes model fitting catboost model Accuracy Linear model 0.302604 Naive bayes 0.295146 CatBoost 0.329747
method description | Linear model | Naive bayes | CatBoost |
---|---|---|---|
Without bigrams | 0.2929 | 0.3019 | 0.3252 |
With bigrams | 0.3026 | 0.2951 | 0.3294 |
Be First to Comment