Press "Enter" to skip to content

Word2Vec、FastText、Glove训练词向量及使用

本站内容均来自兴趣收集,如不慎侵害的您的相关权益,请留言告知,我们将尽快删除.谢谢.

Word2Vec词向量训练及使用

 

Word2Vec的词向量训练在先前的 使用word2vec训练中文维基百科
文章中已经有详细的介绍,这里就不作过多的重复。主要流程为编译后进行训练:

 

git clone https://github.com/tmikolov/.git
cd word2vec
make
./word2vec -train "../data/output.txt" -output "../data/word2vec.model" -cbow 1 -size 300 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 32 -binary 1 -iter 15

 

Word2vec模型的使用

 

import gensim.models.keyedvectors as word2vec
word2vec_model = word2vec.KeyedVectors.load_word2vec_format('data/word2vec.model', binary=True, unicode_errors='ignore')
print(word2vec_model .most_similar(''))

 

使用 unicode_errors=’ignore’ 参数主要是为解决此报错问题:

 

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 96-97: unexpected end of data

 

输出内容:

 

[('性比价', 0.8243662118911743), ('性价', 0.7430107593536377), ('性价比特', 0.6778421401977539), ('信价', 0.5147293210029602), ('CP值', 0.5129910707473755), ('价比', 0.5119792819023132), ('92241473', 0.5006518363952637), ('物有所值', 0.4925231635570526), ('档次', 0.4839213192462921), ('性价', 0.4788089692592621)]

 

Fasttext词向量训练与使用

 

同样在训练之前需要先对官方提供的工具进行编译和安装:

 

# 命令行工具
git clone https://github.com/facebookresearch/.git
cd fastText && make
 
# Python包
git clone https://github.com/facebookresearch/fastText.git
cd fastText
python setup.py install

 

在使用前,我们现来看看示例训练代码word-vector-example.sh:

 

#!/usr/bin/env bash
#
# Copyright (c) 2016-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
#
 
RESULTDIR=result
DATADIR=data
 
mkdir -p "${RESULTDIR}"
mkdir -p "${DATADIR}"
 
if [ ! -f "${DATADIR}/fil9" ]
then
  wget -c http://mattmahoney.net/dc/enwik9.zip -P "${DATADIR}"
  unzip "${DATADIR}/enwik9.zip" -d "${DATADIR}"
  perl wikifil.pl "${DATADIR}/enwik9" > "${DATADIR}"/fil9
fi
 
if [ ! -f "${DATADIR}/rw/rw.txt" ]
then
  wget -c https://nlp.stanford.edu/~lmthang/morphoNLM/rw.zip -P "${DATADIR}"
  unzip "${DATADIR}/rw.zip" -d "${DATADIR}"
fi
make
 
./fasttext skipgram -input "${DATADIR}"/fil9 -output "${RESULTDIR}"/fil9 -lr 0.025 -dim 100 \
  -ws 5 -epoch 1 -minCount 5 -neg 5 -loss ns -bucket 2000000 \
  -minn 3 -maxn 6 -thread 4 -t 1e-4 -lrUpdateRate 100
 
cut -f 1,2 "${DATADIR}"/rw/rw.txt | awk '{print tolower($0)}' | tr '\t' '\n' > "${DATADIR}"/queries.txt
 
cat "${DATADIR}"/queries.txt | ./fasttext print-word-vectors "${RESULTDIR}"/fil9.bin > "${RESULTDIR}"/vectors.txt
 
python eval.py -m "${RESULTDIR}"/vectors.txt -d "${DATADIR}"/rw/rw.txt

 

其中,核心代码为:

 

./fasttext skipgram -input "${DATADIR}"/fil9 -output "${RESULTDIR}"/fil9 -lr 0.025 -dim 100 \
  -ws 5 -epoch 1 -minCount 5 -neg 5 -loss ns -bucket 2000000 \
  -minn 3 -maxn 6 -thread 4 -t 1e-4 -lrUpdateRate 100

 

训练参数含义:

 

$ ./fasttext supervised
Empty input or output path.
 
The following arguments are mandatory:
  -input              训练文件路径(必须)
  -output             输出文件路径(必须)
 
  The following arguments are optional:
  -verbose            verbosity level [2]
 
  The following arguments for the dictionary are optional:
  -minCount           最低词频,1(Word-representation modes skipgram and cbow use a default -minCount of 5)
  -minCountLabel      minimal number of label occurrences [0]
  -wordNgrams         n-grams 设置,默认1
  -bucket             number of buckets [2000000]
  -minn               最小字符长度默认0
  -maxn               最大字符长度默认0
  -t                  采样阈值,默认0.0001
  -label              labels prefix [__label__]
 
  The following arguments for training are optional:
  -lr                 学习率,默认0.1 
  -lrUpdateRate       学习率更新速率,默认100
  -dim                训练的词向量维度,默认100
  -ws                 上下文窗口大小,默认为5
  -epoch              epochs 数量,默认为5
  -neg                number of negatives sampled [5]
  -loss               损失函数 {ns,hs,softmax},默认为 softmax
  -thread             线程数量,默认为12 
  -pretrainedVectors  pretrained word vectors for supervised learning []
  -saveOutput         whether output params should be saved [0]
 
  The following arguments for quantization are optional:
  -cutoff             number of words and ngrams to retain [0]
  -retrain            finetune embeddings if a cutoff is applied [0]
  -qnorm              quantizing the norm separately [0]
  -qout               quantizing the classifier [0]
  -dsub               size of each sub-vector [2]

 

参考链接:https://fasttext.cc/docs/en/options.html

 

最终训练指令为:

 

./fasttext skipgram -input "../data/output.txt" -output "../data/fasttext.model" -lr 0.01 -dim 300 -bucket 2000000 -thread 32

 

Fasttext词向量使用

 

from gensim.models import FastText
fasttext_model  = FastText.load_fasttext_format('data/fasttext.model ')
print(fasttext_model.most_similar('性价比'))

 

输出内容:

 

[('性价比:ok_hand:', 0.9393969178199768), ('性价比市', 0.932769775390625), ('性价比比', 0.9304042458534241), ('性价此', 0.9251571297645569), ('性价比底', 0.9238805174827576), ('x性价比', 0.9228106737136841), ('无性价比', 0.9195789694786072), ('性价比髙', 0.9189218878746033), ('w性价比', 0.9176821112632751), ('性价比赞', 0.9165310263633728)]

 


词向量训练与使用

 

使用Glove训练的方法有很多种,这里介绍的是官方提供的C语言版本。在训练之前首先要下载源码并编译:

 

git clone http://github.com/stanfordnlp/glove
cd glove && make

 

编译完成后默认会在glove目录下生成一个 build 目录,里面生成了4个训练需要用到的工具:

 

    build/
|-- cooccur
|-- glove
|-- shuffle
`-- vocab_count

 

在介绍如何使用这些工具前先来看下示例训练代码demo.sh的内容:

 

#!/bin/bash
set -e
 
# Makes programs, downloads sample data, trains a GloVe model, and then evaluates it.
# One optional argument can specify the language used for eval script: matlab, octave or [default] python
 
make
if [ ! -e text8 ]; then
  if hash wget 2>/dev/null; then
    wget http://mattmahoney.net/dc/text8.zip
  else
    curl -O http://mattmahoney.net/dc/text8.zip
  fi
  unzip text8.zip
  rm text8.zip
fi
 
CORPUS=text8
VOCAB_FILE=vocab.txt
COOCCURRENCE_FILE=cooccurrence.bin
COOCCURRENCE_SHUF_FILE=cooccurrence.shuf.bin
BUILDDIR=build
SAVE_FILE=vectors
VERBOSE=2
MEMORY=4.0
VOCAB_MIN_COUNT=5
VECTOR_SIZE=50
MAX_ITER=15
WINDOW_SIZE=15
BINARY=2
NUM_THREADS=8
X_MAX=10
 
echo
echo "$ $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE"
$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
echo "$ $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE"
$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE
echo "$ $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE"
$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
echo "$ $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE"
$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE
if [ "$CORPUS" = 'text8' ]; then
   if [ "$1" = 'matlab' ]; then
       matlab -nodisplay -nodesktop -nojvm -nosplash < ./eval/matlab/read_and_evaluate.m 1>&2 
   elif [ "$1" = 'octave' ]; then
       octave < ./eval/octave/read_and_evaluate_octave.m 1>&2
   else
       echo "$ python eval/python/evaluate.py"
       python eval/python/evaluate.py
   fi
fi

 

从示例代码可以知道,训练总共分为4步,对应上面的四个工具,顺序依次为vocab_count –> cooccur –> shuffle –> glove:

 

$BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
$BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE
$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
$BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE

 

每一步的作用:

vocab_count:从语料库$CORPUS中统计词频(备注:中文语料要先分好词),输出文件$VOCAB_FILE。每行为 词语 词频
,-min-count 5指示词频低于5的词舍弃,-verbose 2控制屏幕打印信息的,设为0表示不输出。
cooccur:从语料库中统计词共现,输出文件为$COOCCURRENCE_FILE,格式为非文本的二进制,-memory 4.0指示bigram_table缓冲器,-vocab-file指上一步得到的文件,-verbose 2同上,-window-size 5指示词窗口大小。
shuffle:对$COOCCURRENCE_FILE重新整理,输出文件$COOCCURRENCE_SHUF_FILE
glove:训练模型,输出词向量文件。-save-file、-threads、-input-file和-vocab-file直接按照字面应该就可以理解了,-iter表示迭代次数,vector-size表示向量维度大小,-binary控制输出格式0: save as text files; 1: save as binary; 2: both

Glove词向量使用

 

Glove想要在gensim中使用前需要先将其转换为word2vec模型,具体流程为:

 

from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors
 
glove_input_file = 'data/glove.model'
word2vec_output_file = 'data/glove2word2vec.model'
glove_model = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)
print(glove_model.most_similar('性价比'))

 

输出内容:

 

[('高', 0.8684605360031128), ('Ok喇??', 0.8263183832168579), ('超高', 0.8215326070785522), ('性价', 0.7322962880134583), ('价位', 0.7196157574653625), ('价格', 0.7166442275047302), ('实惠', 0.7093995809555054), ('总体', 0.6866426467895508), ('Q性', 0.6845536828041077), ('总之', 0.6692114472389221)]

 

总结

 

针对同一份数据,Glove和Fasttext的训练时间较短,Word2Vec训练耗时较长。其结果看,Glove训练后的结果有些奇怪,感觉没有达到语义相关,更多体现在共现上。

 

其他参考资料:

Google词向量该工具Word2Vec

Facebook词向量工具FastText

斯坦福大学的词向量工具:GloVe

Be First to Comment

发表评论

邮箱地址不会被公开。 必填项已用*标注