欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Elasticsearch:运用 distance feature 查询来增强相关性

程序员文章站 2022-07-05 13:21:43
...

提高文档的相关性得分,使其更接近提供的原始日期或地点。 例如,您可以使用此查询为更接近某个日期或位置的文档赋予更大的权重。

您可以使用distance_feature查询查找与某个位置最近的邻居。 您还可以在布尔搜索的“should”过滤器中使用查询,以将增强的相关性得分添加到布尔查询的得分中。

下面我们用一个具体的例子来展示这个API的使用。

 

准备数据

我们还是拿之前我们的文章“Elasticsearch: 运用Field collapsing来减少基于单个字段的搜索结果”中的索引来做例子。在那个例子里,我们把数据导入到Elasticsearch中。我们可以查看一下它的mapping:

GET best_games/_mapping
{
  "best_games" : {
    "mappings" : {
      "_meta" : {
        "created_by" : "ml-file-data-visualizer"
      },
      "properties" : {
        "critic_score" : {
          "type" : "long"
        },
        "developer" : {
          "type" : "text"
        },
        "genre" : {
          "type" : "keyword"
        },
        "global_sales" : {
          "type" : "double"
        },
        "id" : {
          "type" : "keyword"
        },
        "image_url" : {
          "type" : "keyword"
        },
        "name" : {
          "type" : "text"
        },
        "platform" : {
          "type" : "keyword"
        },
        "publisher" : {
          "type" : "keyword"
        },
        "user_score" : {
          "type" : "long"
        },
        "year" : {
          "type" : "long"
        }
      }
    }
  }
}

我们从上面可以看出来,上面的year显示的是long类型的数据。显然这个不是我们所需要的。我们希望它是date类型的数据。我们需要重新对我们的数据进行reindex。对于不很熟悉reindex的开发者来说,你可以参照我之前的文章“Elasticsearch: Reindex接口”。

我们来重新定义一个新的叫做best_games的索引:

PUT best_games1
{
  "mappings": {
    "properties": {
      "critic_score": {
        "type": "long"
      },
      "developer": {
        "type": "text"
      },
      "genre": {
        "type": "keyword"
      },
      "global_sales": {
        "type": "double"
      },
      "id": {
        "type": "keyword"
      },
      "image_url": {
        "type": "keyword"
      },
      "name": {
        "type": "text"
      },
      "platform": {
        "type": "keyword"
      },
      "publisher": {
        "type": "keyword"
      },
      "user_score": {
        "type": "long"
      },
      "year": {
        "type": "date",
        "format": "strict_year"
      }
    }
  }
}

在上面的mapping里,我们重新定义了year为date类型。那么我们可以通过如下的命令来reindex我们的best_games1索引:

POST _reindex
{
  "source": {
    "index": "best_games"
  },
  "dest": {
    "index": "best_games1"
  }
}

操作完上面的命令后,我们可以重新查看一下我的best_games1索引的文档数量:

GET best_games1/_count

显示结果为:

{
  "count" : 500,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }
}

这样我们在best_games1中已经存在我们想要的索引数据了。

distance_feature 查询

接下来,我们开始做一些查询。比如我们查询一下critical_score大于90的所有文档:

GET /best_games1/_search
{
  "query": {
    "bool": {
      "filter": {
        "range": {
          "critic_score": {
            "gte": 90
          }
        }
      }
    }
  }
}

显示的结果为:

    "hits" : [
      {
        "_index" : "best_games1",
        "_type" : "_doc",
        "_id" : "hnLfF28BjrINWI3xWOLh",
        "_score" : 0.0,
        "_source" : {
          "id" : "mario-kart-ds-ds-2005",
          "name" : "Mario Kart DS",
          "year" : 2005,
          "platform" : "DS",
          "genre" : "Racing",
          "publisher" : "Nintendo",
          "global_sales" : 23.21,
          "critic_score" : 91,
          "user_score" : 8,
          "developer" : "Nintendo",
          "image_url" : "https://upload.wikimedia.org/wikipedia/en/thumb/a/ad/Mario_Kart_DS_screenshot.png/220px-Mario_Kart_DS_screenshot.png"
        }
      },
      {
        "_index" : "best_games1",
        "_type" : "_doc",
        "_id" : "inLfF28BjrINWI3xWOLh",
        "_score" : 0.0,
        "_source" : {
          "id" : "grand-theft-auto-v-ps3-2013",
          "name" : "Grand Theft Auto V",
          "year" : 2013,
          "platform" : "PS3",
          "genre" : "Action",
          "publisher" : "Take-Two Interactive",
          "global_sales" : 21.04,
          "critic_score" : 97,
          "user_score" : 8,
          "developer" : "Rockstar North",
          "image_url" : "https://pmcvariety.files.wordpress.com/2013/09/gta-v-big.jpg?w=1000&h=563&crop=1"
        }
      },
      {
        "_index" : "best_games1",
        "_type" : "_doc",
        "_id" : "i3LfF28BjrINWI3xWOLh",
        "_score" : 0.0,
        "_source" : {
          "id" : "grand-theft-auto-san-andreas-ps2-2004",
          "name" : "Grand Theft Auto: San Andreas",
          "year" : 2004,
          "platform" : "PS2",
          "genre" : "Action",
          "publisher" : "Take-Two Interactive",
          "global_sales" : 20.81,
          "critic_score" : 95,
          "user_score" : 9,
          "developer" : "Rockstar North",
          "image_url" : "http://4.bp.blogspot.com/-IITyrVJdS50/Udvw7XLG-oI/AAAAAAAASwY/H1j2GYBjXng/s1600/GTA+SA+0.jpg"
        }
      },
 ...

显然这个结果里有一些文档的年代非常久远,可能并不是我们想要的结果。那么我改如何把那些靠近我们的年代的文档排名到前面呢?答案是使用distance_feature查询。

我们使用如下的方法来进行查询:

GET /best_games1/_search
{
  "query": {
    "bool": {
      "filter": {
        "range": {
          "critic_score": {
            "gte": 90
          }
        }
      },
      "should": {
        "distance_feature": {
          "field": "year",
          "pivot": "2555d",
          "origin": "now"
        }
      }
    }
  }
}

我们通过distance_feature的引入,上面定义了从现在开始到7年之间(2555天)所有文档的分数都将被提高,同时超过7年之上的文档都只享受一半的提高。那么这样的搜索的结果是:

    "hits" : [
      {
        "_index" : "best_games1",
        "_type" : "_doc",
        "_id" : "-XLfF28BjrINWI3xWOLi",
        "_score" : 0.63837445,
        "_source" : {
          "id" : "uncharted-4-a-thiefs-end-ps4-2016",
          "name" : "Uncharted 4: A Thief's End",
          "year" : 2016,
          "platform" : "PS4",
          "genre" : "Shooter",
          "publisher" : "Sony Computer Entertainment",
          "global_sales" : 5.38,
          "critic_score" : 93,
          "user_score" : 7,
          "developer" : "Naughty Dog",
          "image_url" : "https://i.ytimg.com/vi/hh5HV4iic1Y/maxresdefault.jpg"
        }
      },
      {
        "_index" : "best_games1",
        "_type" : "_doc",
        "_id" : "T3LfF28BjrINWI3xWOPi",
        "_score" : 0.5850225,
        "_source" : {
          "id" : "the-witcher-3-wild-hunt-ps4-2015",
          "name" : "The Witcher 3: Wild Hunt",
          "year" : 2015,
          "platform" : "PS4",
          "genre" : "Role-Playing",
          "publisher" : "Namco Bandai Games",
          "global_sales" : 3.97,
          "critic_score" : 92,
          "user_score" : 9,
          "developer" : "CD Projekt Red Studio",
          "image_url" : "https://www.godisageek.com/wp-content/uploads/the-witcher-3-monster-1024x576.jpg"
        }
      },
      {
        "_index" : "best_games1",
        "_type" : "_doc",
        "_id" : "inLfF28BjrINWI3xWOPi",
        "_score" : 0.5850225,
        "_source" : {
          "id" : "metal-gear-solid-v-the-phantom-pain-ps4-2015",
          "name" : "Metal Gear Solid V: The Phantom Pain",
          "year" : 2015,
          "platform" : "PS4",
          "genre" : "Action",
          "publisher" : "Konami Digital Entertainment",
          "global_sales" : 3.41,
          "critic_score" : 93,
          "user_score" : 8,
          "developer" : "Kojima Productions, Moby Dick Studio",
          "image_url" : "https://i.ytimg.com/vi/gtgNUFSoHv8/maxresdefault.jpg"
        }
      },
  ...

从上面的显示结果可以看出来,靠近最近年份的文档最先出现,而且得分较高。

同样地,我们也可以基于位置对文档进行提高。以下布尔搜索将返回名称为chocolate的文档。 该搜索还使用distance_feature查询来增加位置值接近[-71.3,41.15]的文档的相关性得分。

GET /items/_search
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "name": "chocolate"
        }
      },
      "should": {
        "distance_feature": {
          "field": "location",
          "pivot": "1000m",
          "origin": [-71.3, 41.15]
        }
      }
    }
  }
}

 

参考:

【1】https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-distance-feature-query.html

相关标签: Elastic