Press "Enter" to skip to content

自然语言处理之spaCy

本站内容均来自兴趣收集,如不慎侵害的您的相关权益,请留言告知,我们将尽快删除.谢谢.

是一个Python自然语言处理工具包,诞生于2014年年中,号称“Industrial-Strength Natural Language Processing in ”,是具有工业级强度的Python NLP工具包。spaCy里大量使用了 Cython 来提高相关模块的性能,这个区别于学术性质更浓的Python NLTK,因此具有了业界应用的实际价值。

 

主要特性:


命名实体识别
多语言支持(号称支持53种语言)
针对11种语言的23种统计模型
预训练词向量
高性能
轻松的整合深度学习
词性标注
依存句法分析
句法驱动的句子切分
用于语法和命名实体识别的内置可视化工具
方便的字符串到哈希映射
导出到numpy数据数组
高效的二进制序列化
易于模型打包和部署
稳健,精确评估

SpaCy的安装

 

先执行包的安装: pip install spacy ,再执行数据集和模型的下载。

 

模型地址:

https://github.com/explosion/spacy-models
https://spacy.io/models

比如想安装英文的,执行如下命令即可: python – m spacy download en_core_web_sm

 

使用时加载相应的模型:

 

import spacy
nlp = spacy.load("en_core_web_sm")

 

由于官网没有中文的模型,针对中文模型安装稍微要麻烦些。

 

非官方中文模型地址: https://github.com/howl-anderson/Chinese_models_for_SpaCy

 

下载后执行: pip install . / zh_core_web_sm – 2.0.5.tar.gz

 

安装后执行:

 

import spacy
nlp = spacy.load("zh_core_web_sm")

 

报如下错误:

 

Traceback (most recent call last):
  File "D:/CodeHub/NLP/test_new.py", line 7, in <module>
    nlp = spacy.load('zh_core_web_sm')
  File "D:\CodeHub\NLP\venv\lib\site-packages\spacy\__init__.py", line 30, in load
    return util.load_model(name, **overrides)
  File "D:\CodeHub\NLP\venv\lib\site-packages\spacy\util.py", line 164, in load_model
    return load_model_from_package(name, **overrides)
  File "D:\CodeHub\NLP\venv\lib\site-packages\spacy\util.py", line 185, in load_model_from_package
    return cls.load(**overrides)
  File "D:\CodeHub\NLP\venv\lib\site-packages\zh_core_web_sm\__init__.py", line 12, in load
    return load_model_from_init_py(__file__, **overrides)
  File "D:\CodeHub\NLP\venv\lib\site-packages\spacy\util.py", line 228, in load_model_from_init_py
    return load_model_from_path(data_path, meta, **overrides)
  File "D:\CodeHub\NLP\venv\lib\site-packages\spacy\util.py", line 211, in load_model_from_path
    return nlp.from_disk(model_path)
  File "D:\CodeHub\NLP\venv\lib\site-packages\spacy\language.py", line 941, in from_disk
    util.from_disk(path, deserializers, exclude)
  File "D:\CodeHub\NLP\venv\lib\site-packages\spacy\util.py", line 654, in from_disk
    reader(path / key)
  File "D:\CodeHub\NLP\venv\lib\site-packages\spacy\language.py", line 936, in <lambda>
    p, exclude=["vocab"]
  File "pipes.pyx", line 661, in spacy.pipeline.pipes.Tagger.from_disk
  File "D:\CodeHub\NLP\venv\lib\site-packages\spacy\util.py", line 654, in from_disk
    reader(path / key)
  File "pipes.pyx", line 641, in spacy.pipeline.pipes.Tagger.from_disk.load_model
  File "pipes.pyx", line 643, in spacy.pipeline.pipes.Tagger.from_disk.load_model
  File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\neural\_classes\model.py", line 376, in from_bytes
    copy_array(dest, param[b"value"])
  File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\neural\util.py", line 145, in copy_array
    dst[:] = src
ValueError: could not broadcast input array from shape (128) into shape (96)

 

初步判定是版本问题,重新安装spaCy: pip install spacy == 2.0.5

 

重装完成后模型能正常加载,但是代码不能执行,报如下错误:

 

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\QWD312~1.TCE\AppData\Local\Temp\jieba.cache
Loading model cost 0.785 seconds.
Prefix dict has been built succesfully.
Traceback (most recent call last):
  File "D:/CodeHub/NLP/test_new.py", line 6, in <module>
    doc = nlp("王小明在北京的清华大学读书")
  File "D:\CodeHub\NLP\venv\lib\site-packages\spacy\language.py", line 333, in __call__
    doc = proc(doc)
  File "pipeline.pyx", line 390, in spacy.pipeline.Tagger.__call__
  File "pipeline.pyx", line 402, in spacy.pipeline.Tagger.predict
  File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\neural\_classes\model.py", line 161, in __call__
    return self.predict(x)
  File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\api.py", line 55, in predict
    X = layer(X)
  File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\neural\_classes\model.py", line 161, in __call__
    return self.predict(x)
  File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\api.py", line 293, in predict
    X = layer(layer.ops.flatten(seqs_in, pad=pad))
  File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\neural\_classes\model.py", line 161, in __call__
    return self.predict(x)
  File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\api.py", line 55, in predict
    X = layer(X)
  File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\neural\_classes\model.py", line 161, in __call__
    return self.predict(x)
  File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\neural\_classes\model.py", line 125, in predict
    y, _ = self.begin_update(X)
  File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\api.py", line 374, in uniqued_fwd
    Y_uniq, bp_Y_uniq = layer.begin_update(X_uniq, drop=drop)
  File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\api.py", line 61, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\neural\_classes\layernorm.py", line 51, in begin_update
    X, backprop_child = self.child.begin_update(X, drop=0.)
  File "D:\CodeHub\NLP\venv\lib\site-packages\thinc\neural\_classes\maxout.py", line 69, in begin_update
    output__boc = self.ops.batch_dot(X__bi, W)
  File "ops.pyx", line 338, in thinc.neural.ops.NumpyOps.batch_dot
  File "<__array_function__ internals>", line 6, in dot
ValueError: shapes (7,512) and (640,384) not aligned: 512 (dim 1) != 640 (dim 0)

 

预估还是版本问题,重新一个个版本测试,终于将版本重装为2.0.16可顺利执行:

 

# -*- encoding:utf-8 -*-
import spacy
 
nlp = spacy.load("zh_core_web_sm")
 
doc = nlp("王小明在北京的清华大学读书")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop, token.has_vector,
          token.ent_iob_, token.ent_type_,
          token.vector_norm, token.is_oov)
 
spacy.displacy.serve(doc)

 

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\QWD312~1.TCE\AppData\Local\Temp\jieba.cache
Loading model cost 0.730 seconds.
Prefix dict has been built succesfully.
王小明 王小明 X NNP nsubj xxx True False True B PERSON 14.44006 False
在 在 X VV acl x True True True O  9.84207 False
北京 北京 X NNP det xx True False True B GPE 18.310038 False
的 的 X DEC case:dec x True True True O  10.005628 False
 清华大学 X NNP obj xxxx True False True B ORG 21.960636 False
读书 读书 X VV ROOT xx True False True O  22.59519 False
 
    Serving on port 5000...
    Using the 'dep' visualizer

 

 

SpaCy的使用

 

使用示例:

 

import spacy
 
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")
 
text = "Rami Eid is studying at Stony Brook University in New York"
doc = nlp(text)
 
# 分词 词性标注
for token in doc:
    print(token, token.pos_, token.pos)
 
# 命名实体识别(NER)
for ent in doc.ents:
    print(ent, ent.label_, ent.label)
 
# 名词短语提取
for np in doc.noun_chunks:
    print(np)
 
mport spacy
 
 
nlp = spacy.load("en_core_web_sm")
 
text = "Rami Eid is studying at Stony Brook University in New York"
doc = nlp(text)
 
# 分词 词性标注
for token in doc:
    print(token, token.pos_, token.pos)
 
# 命名实体识别(NER)
for ent in doc.ents:
    print(ent, ent.label_, ent.label)
 
# 名词短语提取
for np in doc.noun_chunks:
    print(np)
 
# 依存关系
for token in doc:
    print(token.text, token.dep_, token.head)
 
# 文本相似度
doc1 = nlp(u"my fries were super gross")
doc2 = nlp(u"such disgusting fries")
similarity = doc1.similarity(doc2)
print(similarity)

 

Spacy里面实体的标签及其表示的含义:

PERSON People, including fictional. 人物
NORP Nationalities or religious or political groups. 国家、宗教、政治团体
FAC Buildings, airports, highways, bridges, etc. 建筑、机场、高速公路、桥梁等
ORG Companies, agencies, institutions, etc. 组织公司、机构等
GPE Countries, cities, states. 国家、城市、州
LOC Non-GPE locations, mountain ranges, bodies of water. 山脉、水体等
PRODUCT Objects, vehicles, foods, etc. (Not services.) 车辆、食物等非服务性的产品
EVENT Named hurricanes, battles, wars, sports events, etc. 飓风、战争、体育赛事等
WORK_OF_ART Titles of books, songs, etc. 书名、歌名等
LAW Named documents made into laws. 法律文书
LANGUAGE Any named language. 语言
DATE Absolute or relative dates or periods. 日期
TIME Times smaller than a day. 小于1天的时间
PERCENT Percentage, including “%”. 百分比
MONEY Monetary values, including unit. 货币价值
QUANTITY Measurements, as of weight or distance. 度量单位
ORDINAL “first”, “second”, etc. 序数词
CARDINAL Numerals that do not fall under another type. 数量词

 

参考链接:

https://spacy.io/
https://github.com/explosion/spaCy

Be First to Comment

发表评论

电子邮件地址不会被公开。 必填项已用*标注