Press "Enter" to skip to content

Elasticsearch 使用不同分词器导致搜索排名的问题

本站内容均来自兴趣收集,如不慎侵害的您的相关权益,请留言告知,我们将尽快删除.谢谢.

相信我们很多人做中文搜索的时候,在 Github
找了 ik
中分分词插件

然后建立 mapping
的时候,很自然的使用这样的参数(参照官方分词文档实例)

{
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "ik_max_word",
                "search_analyzer": "ik_smart"
            }
        }
}

那幺我们来看一下全部数据(打火车和火车两条数据)

curl 127.0.0.1:9200/test/_search | jq
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_1",
        "_score": 1,
        "_source": {
          "id": 1,
          "title": "打火车"
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_2",
        "_score": 1,
        "_source": {
          "id": 2,
          "title": "火车"
        }
      }
    ]
  }
}

这时候我们开始搜索(打火车)

curl 127.0.0.1:9200/test/_search?q=打火车 | jq
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.21110919,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_2",
        "_score": 0.21110919,
        "_source": {
          "id": 2,
          "title": "火车"
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_1",
        "_score": 0.160443,
        "_source": {
          "id": 1,
          "title": "打火车"
        }
      }
    ]
  }
}

这时候我们惊奇的发现 火车
的分值是 0.21110919
居然比 打火车
0.160443
还高

中间经过一路排查, 首先感谢 https://github.com/mobz/elasticsearch-head
插件, 让排查数据的时候减少很多操作.
之后查看文档分词结果就得知了答案

curl 127.0.0.1:9200/test/_doc/Video_1/_termvectors?fields=title | jq
{
  "_index": "test",
  "_type": "_doc",
  "_id": "Video_1",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "title": {
      "field_statistics": {
        "sum_doc_freq": 3,
        "doc_count": 2,
        "sum_ttf": 3
      },
      "terms": {
        "打火": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 2
            }
          ]
        },
        "火车": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 1,
              "end_offset": 3
            }
          ]
        }
      }
    }
  }
}

很惊奇的发现打火车被划分成 打火
火车
两个词, 所以这之中肯定有问题了(当然对于搜索引擎是没有问题的).

打火车
文档中的 火车
得到了分值,但 打火
会使搜索得分下降, 导致 火车
文档的排名靠前
所以我决定把两个分词器设置成一样

{
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "ik_smart",
                "search_analyzer": "ik_smart"
            }
        }
}

然后再看一下分词数据(这次分词的数据的确是我们预想的)

curl 127.0.0.1:9200/test/_doc/Video_1/_termvectors?fields=title | jq
{
  "_index": "test",
  "_type": "_doc",
  "_id": "Video_1",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "title": {
      "field_statistics": {
        "sum_doc_freq": 3,
        "doc_count": 2,
        "sum_ttf": 3
      },
      "terms": {
        "打": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 1
            }
          ]
        },
        "火车": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 1,
              "end_offset": 3
            }
          ]
        }
      }
    }
  }
}

这时我们再搜索一次数据排名, 看到得分值排名的确是我们想要的了.

curl  127.0.0.1:9200/test/_search?q=打火车 | jq
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.77041256,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_1",
        "_score": 0.77041256,
        "_source": {
          "id": 1,
          "title": "打火车"
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "Video_2",
        "_score": 0.21110919,
        "_source": {
          "id": 2,
          "title": "火车"
        }
      }
    ]
  }
}

Be First to Comment

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注