Press "Enter" to skip to content

机器学习 CatBoost 模型是如何自动处理文本信息的?

本站内容均来自兴趣收集,如不慎侵害的您的相关权益,请留言告知,我们将尽快删除.谢谢.

 

有大量与文本信息分析相关的任务。例如:垃圾邮件或非垃圾邮件分类,分析文本的音调,对话系统等。

 

有时,原始任务可以简化为给定观察集上的简单 ML 分类任务

,其中是特征向量,是第个对象的类。在此类任务中,特征可能不仅包含

作为数字或分类值,还包含源文本(例如推文或问题)。

 

Text是一系列符号组成的字符串文本信息,其中称为字母表,例如可以是一组英文字母、Unicode 符号或可以具有更复杂的结构,包括符号序列,也称为token,例如可以是一组英文单词或 Emoji。

 

注意

 

 

文本特征也不能包含NaN值,需要手动将它们转换成字符串。

 

使用仅在 GPU 上支持的文本特征进行训练。

 

训练只能在分类损失和目标的情况下进行。

 

 

本文预处理

 

通常我们将文本作为一系列 Unicode 符号。因此,如果任务不是 DNA 分类,我们不需要这种粒度,此外,我们需要提取更复杂的实体,例如字。从序列中提取符号(单词、数字、标点符号或特殊符号)的过程称为 tokenization 。

 

tokenization 是 CatBoost 中文本预处理的第一部分,作为在字符串(如空格)上的简单拆分序列执行。

 

例子

 

text_small = [
    "Cats are so cute :)",
    "Mouse skare...",
    "The cat defeated the mouse",
    "Cute: Mice gather an army!",
    "Army of mice defeated the cat :(",
    "Cat offers peace",
    "Cat is skared :(",
    "Cat and mouse live in peace :)"
]
target_small = [1, 0, 1, 1, 0, 1, 0, 1]
from catboost.text_processing import Tokenizer
simple_tokenizer = Tokenizer()
def tokenize_texts(texts):
    return [simple_tokenizer.tokenize(text) for text in texts]
tokenized_text = tokenize_texts(text_small)
tokenized_text

 

[['Cats', 'are', 'so', 'cute', ':)'],
 ['Mouse', 'skare...'],
 ['The', 'cat', 'defeated', 'the', 'mouse'],
 ['Cute:', 'Mice', 'gather', 'an', 'army!'],
 ['Army', 'of', 'mice', 'defeated', 'the', 'cat', ':('],
 ['Cat', 'offers', 'peace'],
 ['Cat', 'is', 'skared', ':('],
 ['Cat', 'and', 'mouse', 'live', 'in', 'peace', ':)']]

 

注意,虽然在实例化 Tokenizer 时设置其参数 languages='chinese' ,也可以处理中文文本,但对于处理中文并不是那幺友好,可以尝试其他方法处理中文文本信息。

 

标点处理、大小写转换、词形还原

 

仔细看看小文本示例的tokenization结果——标记包含很多错误:

 

 

它们用标点符号 ‘Cute:’, ‘army!’, ‘skare…’ 连在一起。

 

‘Cat’ 和 ‘cat’, ‘Mice’ 和 ‘mice’ 这两个词似乎有相同的含义,也许它们应该是相同的token。

 

标记 ‘are’/’is’ 存在同样的问题——它们是相同标记 ‘be’ 的变形形式。

 

 

标点处理和 词形还原 过程有助于解决这些问题。

 

标点处理

 

根据任务,标点处理过程可能:

 

 

彻底删除所有标点符号。

 

用空格转义。

 

保持原样(例如,用于更复杂的token集)。

 

 

例子

 

tokenizer = Tokenizer(
    lowercasing=True,
    separator_type='BySense',
    token_types=['Word', 'Number', 'Punctuation']
)
text_small_spaced = [' '.join(tokenizer.tokenize(text)) for text in text_small]
text_small_spaced

 

['cats are so cute :)',
 'mouse skare ...',
 'the cat defeated the mouse',
 'cute : mice gather an army !',
 'army of mice defeated the cat :(',
 'cat offers peace',
 'cat is skared :(',
 'cat and mouse live in peace :)']

 

去除停顿词

 

停用词- 在此任务中被认为是无用的词,例如功能词 * the、is、at、which、on *。

 

通常在文本预处理期间会删除停用词,减少算法考虑的信息量。停用词是手动收集的(以字典形式)或自动收集,例如获取最常用的词。

 

stop_words = ['be', 'is', 'are', 'the', 'an', 'of', 'and', 'in']
def remove_words(texts, words):
    texts_copy = []
    words_set = set(words)
    for text in tokenize_texts(texts):
        text_copy = []
        for token in text:
            if token not in words_set:
                text_copy.append(token)
        texts_copy.append(' '.join(text_copy))
            
    return texts_copy
    
text_small_no_stop = remove_words(text_small_spaced, stop_words)
text_small_no_stop

 

['cats so cute :)',
 'mouse skare ...',
 'cat defeated mouse',
 'cute : mice gather army !',
 'army mice defeated cat :(',
 'cat offers peace',
 'cat skared :(',
 'cat mouse live peace :)']

 

词形还原

 

词元 —— 是一组单词的规范形式、字典形式或引用形式。例如,词元 “go” 表示 “go”、”goes”、”going”、”went” 和 “gone” 的变形形式。将单词转换为其引理的过程称为 词形还原 。

 

from pattern.en import lemma
def lemmatize_text(text):
    return " ".join([lemma(word) for word in text.decode('utf-8').split()])
def lemmatize_texts(texts):
    return [lemmatize_text(text) for text in texts]
text_small_lemmatized = lemmatize_texts(text_small_no_stop)
text_small_lemmatized = tokenize_texts(text_small_lemmatized)
text_small_lemmatized

 

[['cat', 'so', 'cute', ':)'],
 ['mouse', 'skare', '...'],
 ['cat', 'defeat', 'mouse'],
 ['cute', ':', 'mice', 'gather', 'army', '!'],
 ['army', 'mice', 'defeat', 'cat', ':('],
 ['cat', 'offer', 'peace'],
 ['cat', 'skare', ':('],
 ['cat', 'mouse', 'live', 'peace', ':)']]

 

现在用同一个 token 表示同义词, token 不含标点符号。

 

应该为自己的任务进行验证:是否真的有必要删除标点符号、小写句子或执行词形还原和/或 token 标记化?

 

上述结果中, token 'mice'/'mouse' 仍然存在问题,下面使用 gensim lemmatizer 处理他。

 

from gensim.utils import lemmatize
def lemmatize_text_gensim(text):
    result = []
    for token in simple_tokenizer.tokenize(text):
        lemmas = lemmatize(token)
        if len(lemmas) == 0:
            lemma = token.lower()
        else:
            lemma = lemmas[0].decode('utf-8').split('/')[0]
            
        result.append(lemma)
    return ' '.join(result)

 

检查准确性

 

使用新的文本预处理来检查准确性。由于 CatBoost 中不能进行空格标点、小写字母和词形还原等操作,我们需要先手动对文本进行预处理,然后再将其传递给学习算法。

 

新数据准备

 

import pandas as pd
import numpy as np
from catboost import Pool, CatBoostClassifier
from catboost.datasets import rotten_tomatoes
learn, _ = rotten_tomatoes()

 

auxiliary_columns = ['id', 'theater_date', 'dvd_date', 'rating', 'date']
cat_features = ['rating_MPAA', 'studio', 'fresh', 'critic', 'top_critic', 'publisher']
text_features = ['synopsis', 'genre', 'director', 'writer', 'review']
def get_processed_rotten_tomatoes():
    learn, test = rotten_tomatoes()
    
    def fill_na(df, features):
        for feature in features:
            df[feature].fillna('', inplace=True)
    def preprocess_data_part(data_part):
        data_part = data_part.drop(auxiliary_columns, axis=1)
        
        fill_na(data_part, cat_features)
        fill_na(data_part, text_features)
        X = data_part.drop(['rating_10'], axis=1)
        y = data_part['rating_10']
        return X, y
    
    X_learn, y_learn = preprocess_data_part(learn)
    X_test, y_test = preprocess_data_part(test)
    return X_learn, X_test, y_learn, y_test
X_train, X_test, y_train, y_test = get_processed_rotten_tomatoes()

 

预处理

 

由于自然文本特征只是 'synopsis''review' ,我们将只对它们进行预处理。

 

def preprocess_data(X):
    X_preprocessed = X.copy()
    for feature in ['synopsis', 'review']:
        X_preprocessed[feature] = X[feature].apply(lambda x: lemmatize_text(' '.join(tokenizer.tokenize(x))))
    return X_preprocessed
X_preprocessed_train = preprocess_data(X_train)
X_preprocessed_test = preprocess_data(X_test)
X_preprocessed_train['synopsis'].head(10)

 

fit_catboost_on_rotten_tomatoes(X_preprocessed_train, X_preprocessed_test, y_train, y_test)

 

字典

 

在前面完成了文本预处理和标记化之后,现在开始使用准备好的文本来选择一组单元,这些单元将用于构建新的数值特征。

 

一组被选中的单元叫做字典,它可能包含单词、单词字母组合或字符。

 

例子

 

为小文本示例构建一个字典:

 

def build_dictionary(tokenized_texts):
    dictionary = {}
    for text in tokenized_texts:
        for token in text:
            if token not in dictionary:
                size = len(dictionary)
                dictionary[token] = size
    return dictionary
def print_dictionary(dictionary, n_items=5):
    dict_items = sorted(dictionary.items(), key=lambda x: x[1])
    for i in range(n_items):
        word, word_id = dict_items[i]
        print('word="{}" has'.format(word, word_id))
    
    print('...')
dictionary = build_dictionary(text_small_lemmatized)
print_dictionary(dictionary)

 

word="cat" has
word="so" has
word="cute" has
word=":)" has
word="mouse" has
...

 

转换成固定大小的向量

 

大多数经典 ML 算法都在对固定数量的特征进行计算和预测。这意味着学习集包含向量,其中是常数。

 

由于文本对象不是定长向量,我们需要对原点集进行预处理。最简单的文本到矢量编码技术之一是 **词袋 (BoW)**。

 

词袋算法

 

该算法接受字典和文本。在算法文本转换为向量, 其中是从字典中出现的单词到文本。

 

def bag_of_words(texts, dictionary):
    encoded_vectors = []
    dictionary_size = len(dictionary)
    for text in texts:
        vector = [0] * dictionary_size
        for token in text:
            if token in dictionary:
                token_id = dictionary[token]
                vector[token_id] = 1
    
        encoded_vectors.append(vector)
    
    return encoded_vectors
def print_bow_features(bag_of_words, dictionary):
    sorted_dict = sorted(dictionary.items(), key=lambda x: x[1])
    keys = [x[0] for x in sorted_dict]
    bow_df = pd.DataFrame(data=bag_of_words, columns=keys)
    print(bow_df)
    
bow_features = bag_of_words(text_small_lemmatized, dictionary)
print_bow_features(bow_features, dictionary)

 

上下滑动查看更多

 

   cat  so  cute  :)  mouse  skare  ...  defeat  :  mice  gather  army  !  :(  \
0    1   1     1   1      0      0    0       0  0     0       0     0  0   0   
1    0   0     0   0      1      1    1       0  0     0       0     0  0   0   
2    1   0     0   0      1      0    0       1  0     0       0     0  0   0   
3    0   0     1   0      0      0    0       0  1     1       1     1  1   0   
4    1   0     0   0      0      0    0       1  0     1       0     1  0   1   
5    1   0     0   0      0      0    0       0  0     0       0     0  0   0   
6    1   0     0   0      0      1    0       0  0     0       0     0  0   1   
7    1   0     0   1      1      0    0       0  0     0       0     0  0   0   
   offer  peace  live  
0      0      0     0  
1      0      0     0  
2      0      0     0  
3      0      0     0  
4      0      0     0  
5      1      1     0  
6      0      0     0  
7      0      1     1

 

例如,有了这样的向量,我们可以拟合线性或朴素贝叶斯模型。

 

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from scipy.sparse import csr_matrix
def fit_linear_model(X, c):
    model = LogisticRegression()
    model.fit(X, c)
    return model
def fit_naive_bayes(X, c):
    clf = MultinomialNB()
    if isinstance(X, csr_matrix):
        X.eliminate_zeros()
    clf.fit(X, c)
    return clf
linear_model = fit_linear_model(bow_features, target_small)
naive_bayes = fit_naive_bayes(bow_features, target_small)

 

/home/d-kruchinin/.local/lib/python2.7/site-packages/sklearn/linear_model/logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

 

from sklearn.metrics import log_loss
def evaluate_model_logloss(model, X, c):
    c_pred = model.predict_proba(X)[:,1]
    metric = log_loss(c, c_pred)
    print('Logloss: ' + str(metric))
print('Linear model')
evaluate_model_logloss(linear_model, bow_features, target_small)
print('Naive bayes')
evaluate_model_logloss(naive_bayes, bow_features, target_small)
print('Comparing to constant prediction')
logloss_constant_prediction = log_loss(target_small, np.ones(shape=(len(text_small), 2)) * 0.5)
print('Logloss: ' + str(logloss_constant_prediction))

 

Linear model
Logloss: 0.3314294362422291
Naive bayes
Logloss: 0.1667380176962438
Comparing to constant prediction
Logloss: 0.6931471805599453

 

查看字母/单词的序列

 

例子文本 'The cat defeated the mouse''Army of mice defeated the cat' ,每个句子都用三个token来简化它: 'cat defeat mouse''mouse defeat cat' 。应用 BoW 后,我们得到两个相等的具有相反含义的向量:

 

catmousedefeat
111
111

 

那幺如何区分它们?继续将单词序列作为单个token添加到我们的字典中:

 

 

catmousedefeatcat_defeatmouse_defeatdefeat_catdefeat_mouse
1111001
1110110

 

N-gram是来自给定文本或语音样本的个items的连续序列。在上面的例子中 Bi-gram (Bigram) = 2-gram of words

 

N-grams有助于将更多关于文本结构的信息添加到向量中,此外, n-grams 在分离中没有意义,例如, 'Mickey Mouse company'

 

例子

 

def build_bigram_dictionary(tokenized_texts):
    dictionary = {}
    for text in tokenized_texts:
        for i in range(len(text) - 1):
            token1, token2 = text[i], text[i + 1]
            bigram = token1 + ' ' + token2
            
            if bigram not in dictionary:
                dictionary_size = len(dictionary)
                dictionary[bigram] = dictionary_size
    return dictionary
bigram_word_dictionary = build_bigram_dictionary(text_small_lemmatized)
print_dictionary(bigram_word_dictionary)

 

word="cat so" has
word="so cute" has
word="cute :)" has
word="mouse skare" has
word="skare ..." has
...

 

CatBoost 中的字典

 

要指定在 CatBoost 中创建哪种类型的字典,需要传递参数 dictionaries 。此参数指定在文本预处理过程中计算的所有字典。

 

字典参数指定为字符串列表,每个字符串是字典的描述,格式如下: 'DictionaryName:[Param1=Value1,[Param2=Value2]]'

 

以下是所有参数的列表:

min_token_occurrence — 数量;输入字典的最小token出现次数

max_dict_size — 数字;最大字典大小

token_level_type
Word
Letter

gram_order — 编号;构建 n-gram 字典。

参数对模型的影响

 

min_token_occurrence — 参数对于过滤太稀有的token非常有用,这有助于避免过度拟合。

 

max_dict_size — 参数可以帮助控制模型的大小。

 

fit_catboost_on_rotten_tomatoes(
    X_preprocessed_train,
    X_preprocessed_test, 
    y_train, 
    y_test,
    catboost_params={
        'dictionaries': [
            'Word:min_token_occurrence=5',
            'BiGram:gram_order=2'
        ],
        'text_processing': [
            'NaiveBayes+Word|BoW+Word,BiGram'
        ]
    }
)

 

上下滑动查看更多

 

0: learn: 0.3855466 test: 0.3940580 best: 0.3940580 (0) total: 107ms remaining: 1m 46s
100: learn: 0.4497432 test: 0.4521335 best: 0.4529894 (97) total: 4.49s remaining: 39.9s
200: learn: 0.4622463 test: 0.4624037 best: 0.4637486 (189) total: 8.5s remaining: 33.8s
300: learn: 0.4705307 test: 0.4636264 best: 0.4639932 (299) total: 12.5s remaining: 29.1s
400: learn: 0.4780509 test: 0.4653381 best: 0.4671720 (339) total: 16.6s remaining: 24.8s
500: learn: 0.4839203 test: 0.4666830 best: 0.4680279 (466) total: 20.7s remaining: 20.6s
600: learn: 0.4906151 test: 0.4702286 best: 0.4707177 (592) total: 24.9s remaining: 16.5s
700: learn: 0.4963928 test: 0.4703509 best: 0.4714513 (651) total: 29.2s remaining: 12.5s
800: learn: 0.5022316 test: 0.4724294 best: 0.4729184 (795) total: 33.6s remaining: 8.34s
900: learn: 0.5072450 test: 0.4740188 best: 0.4746302 (882) total: 38s remaining: 4.17s
999: learn: 0.5120751 test: 0.4749969 best: 0.4758528 (946) total: 42.2s remaining: 0us
bestTest = 0.4758527937
bestIteration = 946
Shrink model to first 947 iterations.

 

CatBoost 中的特征计算

 

由于文本被转换为一系列标记索引,因此此信息允许 CatBoost 计算不同的数字特征:

 

 

词袋:0/1 特征(文本样本有或没有token_id),产生的数字特征数=字典大小。

 

NaiveBayes:多项朴素贝叶斯模型,产生的特征数量等于类的数量。

 

BM25也是在线计算的,它是搜索引擎用于排名目的来估计文档相关性的函数。

 

 

可以在参数 "text_processing" 中指定要计算的特征。

 

参数 text_processing

 

参数 text_processing 指定如何预处理文本特征。

 

文本处理参数指定为字符串列表,每个字符串是特征预处理的描述,格式如下:

 

'FeatureId~[FeatureEstimator1+DictionaryName1[|FeatureEstimator2+DictionaryName2]]'

 

示例: '0~BoW+Word|NaiveBayes+Word,Bigram' ,

 

在第 0 个文本特征 BoWNaiveBayes 特征将使用 WordWord,Bigram 字典相应地计算。也可以指定 default~... 文本特征(或空FeatureId),这是所有文本特征将使用相同的程序进行预处理,在参数中指定。

 

字典名称取自 dictionaries 参数。

 

还可以为估算器指定参数,例如对于词袋,可以指定参数”top_tokens_count”,该参数设置词袋中用于向量化的最大标记数,采用最频繁的标记。参数 top_tokens_count 高度影响 BoW 估计器中的 CPU 和 GPU RAM 使用率 。

 

fit_catboost_on_rotten_tomatoes(
    X_preprocessed_train,
    X_preprocessed_test, 
    y_train, 
    y_test,
    catboost_params={
        'dictionaries': [
            'Word:min_token_occurrence=5',
            'BiGram:gram_order=2'
        ],
        'text_processing': [
            'NaiveBayes+Word|BoW:top_tokens_count=1000+Word,BiGram|BM25+Word'
        ]
    }
)

 

上下滑动查看更多

 

0: learn: 0.3985388 test: 0.4054285 best: 0.4054285 (0) total: 94.5ms remaining: 1m 34s
100: learn: 0.4494987 test: 0.4534784 best: 0.4534784 (100) total: 4.22s remaining: 37.5s
200: learn: 0.4620934 test: 0.4587358 best: 0.4593471 (196) total: 8.16s remaining: 32.5s
300: learn: 0.4711727 test: 0.4635041 best: 0.4639932 (265) total: 12.1s remaining: 28.1s
400: learn: 0.4797628 test: 0.4675388 best: 0.4681501 (399) total: 16.1s remaining: 24s
500: learn: 0.4874052 test: 0.4691283 best: 0.4696173 (496) total: 20.1s remaining: 20s
600: learn: 0.4942834 test: 0.4710845 best: 0.4719403 (596) total: 24s remaining: 15.9s
700: learn: 0.5011617 test: 0.4709622 best: 0.4719403 (596) total: 27.8s remaining: 11.9s
800: learn: 0.5069393 test: 0.4708400 best: 0.4719403 (596) total: 31.8s remaining: 7.89s
900: learn: 0.5133284 test: 0.4716958 best: 0.4723071 (836) total: 35.7s remaining: 3.92s
999: learn: 0.5190144 test: 0.4721849 best: 0.4732852 (950) total: 39.6s remaining: 0us
bestTest = 0.4732852427
bestIteration = 950
Shrink model to first 951 iterations.

 

总结: CatBoost 中的文本特征

 

算法:

 

 

    1. 将输入文本加载为通常的列。

text_column: [string]

 

    1. 每个文本样本通过空格分割进行标记。

tokenized_column: [[string]]

 

字典估计。

将已标记列中的每个字符串都从字典转换为 token_id。
text: [[token_id]]

CatBoost 根据参数”词袋、多项式朴素贝叶斯或 Bm25″,并基于文本列结果生成特征。

 

计算出的浮点特征被传递到通常的 CatBoost 学习算法中。

method descriptionAccuracy
Without text features0.4562
With unpreprocessed text features0.4707
After punctuation handling and lemmatization (only review column)0.4719
After adding bigrams0.4759

 

与经典方法的简化比较

 

经典方法:朴素贝叶斯和逻辑回归

 

只取一个文本列来比较文本分类的效果。

 

上下滑动查看更多源码

 

X_train_one_column = pd.DataFrame(X_preprocessed_train['review'])
X_test_one_column = pd.DataFrame(X_preprocessed_test['review'])
def fit_catboost_one_column(X_train, X_test, y_train, y_test, catboost_params={}, verbose=0):
    learn_pool = Pool(X_train, y_train, text_features=[0])
    test_pool = Pool(X_test, y_test, text_features=[0])
    
    catboost_default_params = {
        'iterations': 1000,
        'learning_rate': 0.03,
        'eval_metric': 'Accuracy',
        'task_type': 'GPU'
    }
    
    catboost_default_params.update(catboost_params)
    
    model = CatBoostClassifier(**catboost_default_params)
    model.fit(learn_pool, eval_set=test_pool, verbose=verbose)
    return model
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
def vectorize(X, params):
    vectorizer = CountVectorizer(**params)
    vectorizer.fit(X)
    return vectorizer.transform(X), vectorizer
def eval_accuracy(model, X, c):
    c_pred = model.predict(X)
    return accuracy_score(c_pred, c)
def fit_and_compute_accuracy(X_train, X_test, y_train, y_test, vectorizer_params={}, catboost_params={}):
    X_train_bow, vectorizer = vectorize(X_train.iloc[:,0], vectorizer_params)
    X_test_bow = vectorizer.transform(X_test.iloc[:,0])
    
    print('fitting linear model')
    linear_model = fit_linear_model(X_train_bow, y_train)
    
    print('fitting naive bayes model')
    naive_bayes = fit_naive_bayes(X_train_bow, y_train)
    
    print('fitting catboost model')
    cb_model = fit_catboost_one_column(X_train, X_test, y_train, y_test, catboost_params)
    linear_accuracy = eval_accuracy(linear_model, X_test_bow, y_test)
    naive_bayes_accuracy = eval_accuracy(naive_bayes, X_test_bow, y_test)
    cb_accuracy = eval_accuracy(cb_model, X_test, y_test)
    results = pd.DataFrame(
        data=[linear_accuracy, naive_bayes_accuracy, cb_accuracy], 
        index=['Linear model', 'Naive bayes', 'CatBoost'],
        columns=['Accuracy']
    )
    print(results)

 

没有二元组的实验

 

fit_and_compute_accuracy(
    X_train_one_column, 
    X_test_one_column, 
    y_train, 
    y_test,
    catboost_params = {
        'dictionaries': ['Word:token_level_type=Word,min_token_occurrence=5'],
        'text_processing': ['NaiveBayes+Word|BoW+Word']
    }
)

 

fitting linear model
/home/d-kruchinin/.local/lib/python2.7/site-packages/sklearn/linear_model/logistic.py:460: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
fitting naive bayes model
fitting catboost model
              Accuracy
Linear model  0.292945
Naive bayes   0.301871
CatBoost      0.325223

 

有二元组的实验

 

fit_and_compute_accuracy(
    X_train_one_column,
    X_test_one_column, 
    y_train, 
    y_test, 
    vectorizer_params = {'ngram_range': (1, 2)},
    catboost_params = {
        'dictionaries': [
            'Word:token_level_type=Word,min_token_occurrence=5', 
            'BiGram:gram_order=2,min_token_occurrence=4'
        ],
        'text_processing': ['NaiveBayes+Word,BiGram|BoW+Word,BiGram']
    }
)

 

fitting linear model
fitting naive bayes model
fitting catboost model
              Accuracy
Linear model  0.302604
Naive bayes   0.295146
CatBoost      0.329747
method descriptionLinear modelNaive bayesCatBoost
Without bigrams0.29290.30190.3252
With bigrams0.30260.29510.3294

 

Be First to Comment

发表回复

您的电子邮箱地址不会被公开。