Elasticsearch:运用 distance feature 查询来增强相关性
提高文档的相关性得分,使其更接近提供的原始日期或地点。 例如,您可以使用此查询为更接近某个日期或位置的文档赋予更大的权重。
您可以使用distance_feature查询查找与某个位置最近的邻居。 您还可以在布尔搜索的“should”过滤器中使用查询,以将增强的相关性得分添加到布尔查询的得分中。
下面我们用一个具体的例子来展示这个API的使用。
准备数据
我们还是拿之前我们的文章“Elasticsearch: 运用Field collapsing来减少基于单个字段的搜索结果”中的索引来做例子。在那个例子里,我们把数据导入到Elasticsearch中。我们可以查看一下它的mapping:
GET best_games/_mapping
{
"best_games" : {
"mappings" : {
"_meta" : {
"created_by" : "ml-file-data-visualizer"
},
"properties" : {
"critic_score" : {
"type" : "long"
},
"developer" : {
"type" : "text"
},
"genre" : {
"type" : "keyword"
},
"global_sales" : {
"type" : "double"
},
"id" : {
"type" : "keyword"
},
"image_url" : {
"type" : "keyword"
},
"name" : {
"type" : "text"
},
"platform" : {
"type" : "keyword"
},
"publisher" : {
"type" : "keyword"
},
"user_score" : {
"type" : "long"
},
"year" : {
"type" : "long"
}
}
}
}
}
我们从上面可以看出来,上面的year显示的是long类型的数据。显然这个不是我们所需要的。我们希望它是date类型的数据。我们需要重新对我们的数据进行reindex。对于不很熟悉reindex的开发者来说,你可以参照我之前的文章“Elasticsearch: Reindex接口”。
我们来重新定义一个新的叫做best_games的索引:
PUT best_games1
{
"mappings": {
"properties": {
"critic_score": {
"type": "long"
},
"developer": {
"type": "text"
},
"genre": {
"type": "keyword"
},
"global_sales": {
"type": "double"
},
"id": {
"type": "keyword"
},
"image_url": {
"type": "keyword"
},
"name": {
"type": "text"
},
"platform": {
"type": "keyword"
},
"publisher": {
"type": "keyword"
},
"user_score": {
"type": "long"
},
"year": {
"type": "date",
"format": "strict_year"
}
}
}
}
在上面的mapping里,我们重新定义了year为date类型。那么我们可以通过如下的命令来reindex我们的best_games1索引:
POST _reindex
{
"source": {
"index": "best_games"
},
"dest": {
"index": "best_games1"
}
}
操作完上面的命令后,我们可以重新查看一下我的best_games1索引的文档数量:
GET best_games1/_count
显示结果为:
{
"count" : 500,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
}
}
这样我们在best_games1中已经存在我们想要的索引数据了。
distance_feature 查询
接下来,我们开始做一些查询。比如我们查询一下critical_score大于90的所有文档:
GET /best_games1/_search
{
"query": {
"bool": {
"filter": {
"range": {
"critic_score": {
"gte": 90
}
}
}
}
}
}
显示的结果为:
"hits" : [
{
"_index" : "best_games1",
"_type" : "_doc",
"_id" : "hnLfF28BjrINWI3xWOLh",
"_score" : 0.0,
"_source" : {
"id" : "mario-kart-ds-ds-2005",
"name" : "Mario Kart DS",
"year" : 2005,
"platform" : "DS",
"genre" : "Racing",
"publisher" : "Nintendo",
"global_sales" : 23.21,
"critic_score" : 91,
"user_score" : 8,
"developer" : "Nintendo",
"image_url" : "https://upload.wikimedia.org/wikipedia/en/thumb/a/ad/Mario_Kart_DS_screenshot.png/220px-Mario_Kart_DS_screenshot.png"
}
},
{
"_index" : "best_games1",
"_type" : "_doc",
"_id" : "inLfF28BjrINWI3xWOLh",
"_score" : 0.0,
"_source" : {
"id" : "grand-theft-auto-v-ps3-2013",
"name" : "Grand Theft Auto V",
"year" : 2013,
"platform" : "PS3",
"genre" : "Action",
"publisher" : "Take-Two Interactive",
"global_sales" : 21.04,
"critic_score" : 97,
"user_score" : 8,
"developer" : "Rockstar North",
"image_url" : "https://pmcvariety.files.wordpress.com/2013/09/gta-v-big.jpg?w=1000&h=563&crop=1"
}
},
{
"_index" : "best_games1",
"_type" : "_doc",
"_id" : "i3LfF28BjrINWI3xWOLh",
"_score" : 0.0,
"_source" : {
"id" : "grand-theft-auto-san-andreas-ps2-2004",
"name" : "Grand Theft Auto: San Andreas",
"year" : 2004,
"platform" : "PS2",
"genre" : "Action",
"publisher" : "Take-Two Interactive",
"global_sales" : 20.81,
"critic_score" : 95,
"user_score" : 9,
"developer" : "Rockstar North",
"image_url" : "http://4.bp.blogspot.com/-IITyrVJdS50/Udvw7XLG-oI/AAAAAAAASwY/H1j2GYBjXng/s1600/GTA+SA+0.jpg"
}
},
...
显然这个结果里有一些文档的年代非常久远,可能并不是我们想要的结果。那么我改如何把那些靠近我们的年代的文档排名到前面呢?答案是使用distance_feature查询。
我们使用如下的方法来进行查询:
GET /best_games1/_search
{
"query": {
"bool": {
"filter": {
"range": {
"critic_score": {
"gte": 90
}
}
},
"should": {
"distance_feature": {
"field": "year",
"pivot": "2555d",
"origin": "now"
}
}
}
}
}
我们通过distance_feature的引入,上面定义了从现在开始到7年之间(2555天)所有文档的分数都将被提高,同时超过7年之上的文档都只享受一半的提高。那么这样的搜索的结果是:
"hits" : [
{
"_index" : "best_games1",
"_type" : "_doc",
"_id" : "-XLfF28BjrINWI3xWOLi",
"_score" : 0.63837445,
"_source" : {
"id" : "uncharted-4-a-thiefs-end-ps4-2016",
"name" : "Uncharted 4: A Thief's End",
"year" : 2016,
"platform" : "PS4",
"genre" : "Shooter",
"publisher" : "Sony Computer Entertainment",
"global_sales" : 5.38,
"critic_score" : 93,
"user_score" : 7,
"developer" : "Naughty Dog",
"image_url" : "https://i.ytimg.com/vi/hh5HV4iic1Y/maxresdefault.jpg"
}
},
{
"_index" : "best_games1",
"_type" : "_doc",
"_id" : "T3LfF28BjrINWI3xWOPi",
"_score" : 0.5850225,
"_source" : {
"id" : "the-witcher-3-wild-hunt-ps4-2015",
"name" : "The Witcher 3: Wild Hunt",
"year" : 2015,
"platform" : "PS4",
"genre" : "Role-Playing",
"publisher" : "Namco Bandai Games",
"global_sales" : 3.97,
"critic_score" : 92,
"user_score" : 9,
"developer" : "CD Projekt Red Studio",
"image_url" : "https://www.godisageek.com/wp-content/uploads/the-witcher-3-monster-1024x576.jpg"
}
},
{
"_index" : "best_games1",
"_type" : "_doc",
"_id" : "inLfF28BjrINWI3xWOPi",
"_score" : 0.5850225,
"_source" : {
"id" : "metal-gear-solid-v-the-phantom-pain-ps4-2015",
"name" : "Metal Gear Solid V: The Phantom Pain",
"year" : 2015,
"platform" : "PS4",
"genre" : "Action",
"publisher" : "Konami Digital Entertainment",
"global_sales" : 3.41,
"critic_score" : 93,
"user_score" : 8,
"developer" : "Kojima Productions, Moby Dick Studio",
"image_url" : "https://i.ytimg.com/vi/gtgNUFSoHv8/maxresdefault.jpg"
}
},
...
从上面的显示结果可以看出来,靠近最近年份的文档最先出现,而且得分较高。
同样地,我们也可以基于位置对文档进行提高。以下布尔搜索将返回名称为chocolate的文档。 该搜索还使用distance_feature查询来增加位置值接近[-71.3,41.15]的文档的相关性得分。
GET /items/_search
{
"query": {
"bool": {
"must": {
"match": {
"name": "chocolate"
}
},
"should": {
"distance_feature": {
"field": "location",
"pivot": "1000m",
"origin": [-71.3, 41.15]
}
}
}
}
}
参考:
上一篇: 基于Redis实现分布式锁