相信我们很多人做中文搜索的时候,在 Github
找了 ik
中分分词插件
然后建立 mapping
的时候,很自然的使用这样的参数(参照官方分词文档实例)
{ "properties": { "title": { "type": "text", "analyzer": "ik_max_word", "search_analyzer": "ik_smart" } } }
那幺我们来看一下全部数据(打火车和火车两条数据)
curl 127.0.0.1:9200/test/_search | jq { "took": 1, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 2, "relation": "eq" }, "max_score": 1, "hits": [ { "_index": "test", "_type": "_doc", "_id": "Video_1", "_score": 1, "_source": { "id": 1, "title": "打火车" } }, { "_index": "test", "_type": "_doc", "_id": "Video_2", "_score": 1, "_source": { "id": 2, "title": "火车" } } ] } }
这时候我们开始搜索(打火车)
curl 127.0.0.1:9200/test/_search?q=打火车 | jq { "took": 2, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 2, "relation": "eq" }, "max_score": 0.21110919, "hits": [ { "_index": "test", "_type": "_doc", "_id": "Video_2", "_score": 0.21110919, "_source": { "id": 2, "title": "火车" } }, { "_index": "test", "_type": "_doc", "_id": "Video_1", "_score": 0.160443, "_source": { "id": 1, "title": "打火车" } } ] } }
这时候我们惊奇的发现 火车
的分值是 0.21110919
居然比 打火车
的 0.160443
还高
中间经过一路排查, 首先感谢 https://github.com/mobz/elasticsearch-head
插件, 让排查数据的时候减少很多操作.
之后查看文档分词结果就得知了答案
curl 127.0.0.1:9200/test/_doc/Video_1/_termvectors?fields=title | jq { "_index": "test", "_type": "_doc", "_id": "Video_1", "_version": 1, "found": true, "took": 0, "term_vectors": { "title": { "field_statistics": { "sum_doc_freq": 3, "doc_count": 2, "sum_ttf": 3 }, "terms": { "打火": { "term_freq": 1, "tokens": [ { "position": 0, "start_offset": 0, "end_offset": 2 } ] }, "火车": { "term_freq": 1, "tokens": [ { "position": 1, "start_offset": 1, "end_offset": 3 } ] } } } } }
很惊奇的发现打火车被划分成 打火
和 火车
两个词, 所以这之中肯定有问题了(当然对于搜索引擎是没有问题的).
打火车
文档中的 火车
得到了分值,但 打火
会使搜索得分下降, 导致 火车
文档的排名靠前
所以我决定把两个分词器设置成一样
{ "properties": { "title": { "type": "text", "analyzer": "ik_smart", "search_analyzer": "ik_smart" } } }
然后再看一下分词数据(这次分词的数据的确是我们预想的)
curl 127.0.0.1:9200/test/_doc/Video_1/_termvectors?fields=title | jq { "_index": "test", "_type": "_doc", "_id": "Video_1", "_version": 1, "found": true, "took": 0, "term_vectors": { "title": { "field_statistics": { "sum_doc_freq": 3, "doc_count": 2, "sum_ttf": 3 }, "terms": { "打": { "term_freq": 1, "tokens": [ { "position": 0, "start_offset": 0, "end_offset": 1 } ] }, "火车": { "term_freq": 1, "tokens": [ { "position": 1, "start_offset": 1, "end_offset": 3 } ] } } } } }
这时我们再搜索一次数据排名, 看到得分值排名的确是我们想要的了.
curl 127.0.0.1:9200/test/_search?q=打火车 | jq { "took": 1, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 2, "relation": "eq" }, "max_score": 0.77041256, "hits": [ { "_index": "test", "_type": "_doc", "_id": "Video_1", "_score": 0.77041256, "_source": { "id": 1, "title": "打火车" } }, { "_index": "test", "_type": "_doc", "_id": "Video_2", "_score": 0.21110919, "_source": { "id": 2, "title": "火车" } } ] } }
Be First to Comment