elasticsearch 中关键属性、字段等说明

程序员文章站 2022-03-01 12:54:32

...

前提说明：本文基于elasticsearch 6.4.2 所写，可能个版本间会有细微差别

文档、索引、类型

属性	说明
文档（document）	所要存储的数据信息，比如：存储员工数据，一个员工数据即可代表一个文档
索引（index）	文档存储到 Elasticsearch 的行为叫做索引；一个索引类似于传统关系数据库中的一个数据库，是一个存储关系型文档的地方；一个elasticsearch 中可以包含多个索引
类型（type）	指定存储文档的具体类型，可以理解为关系数据库中的单个表。备注：在6.X版本同一个索引下只能有一个类型，在7.X版本开始移除类型，8.X以后彻底移除类型

说明：为什么移除类型。¹

倒排索引

属性	说明
倒排索引	倒排索引由文档中所有不重复词的列表构成，对于其中每个词，有一个包含它的文档列表。

查询返回字段解释

{
  "took": 1,  //整个搜索请求花费多少毫秒
  "timed_out": false,
  "_shards": { //指示搜索了多少分片，以及搜索成功和失败的分片的计数
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": { //用来实际搜索结果集
    "total": 1, // 搜索返回条数
    "max_score": 0.83355963, //搜索结果匹配度
    "hits": [ //查询完整的数据  以_score降序排序
      {
        "_index": "test", //索引
        "_type": "test", //类型
        "_id": "xtoffH0BS_BXL2CbFNQ_", //索引数据id
        "_score": 0.83355963, // 衡量文档与查询的匹配程度
        "_source": { // 结果原数据
          "name": "新浪军事22"
        }
      }
    ]
  }
}

新增文档

# 自动生成唯一 _id ：
POST /test/test
{
 "name":"我是谁，谁是我" 
}

# 返回结果：
{
  "_index": "test",
  "_type": "test",
  "_id": "gLKe4X4BLI5cjpF7njkO", //自动生成的id
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 1,
  "_primary_term": 1
}

# 手动设置id
PUT /test/test/465?op_type=create
{
 "name":"我是谁，谁是我" 
}
或
PUT /test/test/465/_create
{
 "name":"我是谁，谁是我" 
}
如果创建新文档的请求成功执行则返回
{
  "_index": "test",
  "_type": "test",
  "_id": "465",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}
另一方面，如果具有相同的 _index 、 _type 和 _id 的文档已经存在，Elasticsearch 将会返回 409 Conflict 响应码，以及如下的错误信息：
{
   "error": {
      "root_cause": [
         {
            "type": "document_already_exists_exception",
            "reason": "[blog][123]: document already exists",
            "shard": "0",
            "index": "website"
         }
      ],
      "type": "document_already_exists_exception",
      "reason": "[blog][123]: document already exists",
      "shard": "0",
      "index": "website"
   },
   "status": 409
}

更新文档

重点： 在 Elasticsearch 中文档是 不可改变 的，不能修改它们。相反，如果想要更新现有的文档，需要重建索引或者进行替换，实际更新的过程是这样的：

从旧文档构建 JSON
更改该 JSON
删除旧文档
索引一个新文档
在内部，Elasticsearch 已将旧文档标记为已删除，并增加一个全新的文档。尽管你不能再对旧版本的文档进行访问，但它并不会立即消失。当继续索引更多的数据，Elasticsearch 会在后台清理这些已删除文档。

# 更新整个文档，如果文档字段不在更新字段中，则会被舍弃
PUT /test/test/465
{
  "title": "My first blog entry",
  "text":  "I am starting to get the hang of this...",
  "date":  "2014/01/02"
}
# 返回结果
{
  "_index": "test",
  "_type": "test",
  "_id": "465",
  "_version": 3,
  "result": "updated",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 2,
  "_primary_term": 1
}

# 更新 文档的部分字段
POST /test/test/465/_update
{
   "doc" : {
      "tags" : [ "testing" ],
      "title": "天天向上"
   }
}
# 返回结果
{
  "_index": "test",
  "_type": "test",
  "_id": "465",
  "_version": 4,
  "result": "updated",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 3,
  "_primary_term": 1
}

删除文档

更具id删除文档
DELETE /website/blog/123
# 返回结果
{
  "_index": "test",
  "_type": "test",
  "_id": "uZOYM34BLI5cjpF7dwsz",
  "_version": 6,
  "result": "deleted",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 8,
  "_primary_term": 1
}
# 更具指定条件删除文档
POST test/test/_delete_by_query
{
  "query": {
    "match": {
      "name": "新浪军事22"
    }
  }
}
# 返回结果
{
  "took": 477,
  "timed_out": false,
  "total": 4,
  "deleted": 4,
  "batches": 1,
  "version_conflicts": 0,
  "noops": 0,
  "retries": {
    "bulk": 0,
    "search": 0
  },
  "throttled_millis": 0,
  "requests_per_second": -1,
  "throttled_until_millis": 0,
  "failures": []
}

查询文档

根据id查询文档

# 根据id查询
GET /test/test/465?pretty  // pretty关键字只是用来格式化结果集用的，让结果看起来更好看没其他作用
# 返回结果：
{
  "_index": "test",
  "_type": "test",
  "_id": "465",
  "_version": 4,
  "found": true,
  "_source": {
    "title": "天天向上",
    "text": "I am starting to get the hang of this...",
    "date": "2014/01/02",
    "tags": [
      "testing"
    ]
  }
}

# 根据id查询并返回指定字段
GET /test/test/465?_source=title,text
# 返回结果
{
  "_index": "test",
  "_type": "test",
  "_id": "465",
  "_version": 4,
  "found": true,
  "_source": {
    "text": "I am starting to get the hang of this...",
    "title": "天天向上"
  }
}
# 查询多个索引数据
mget API 要求有一个 docs 数组作为参数，每个元素包含需要检索文档的元数据， 包括 _index 、 _type 和 _id 。如果你想检索一个或者多个特定的字段，那么你可以通过 _source 参数来指定这些字段的名字
GET /_mget
{
  "docs" : [
    {
      "_index":"test",
      "_type":"test",
      "_id":"gLKe4X4BLI5cjpF7njkO"
    },
    {
      "_index":"test01",
      "_type":"test01",
      "_id":123
    }
    ]
}
# 如果想检索的数据都在相同的 _index 中（甚至相同的 _type 中），则可以在 URL 中指定默认的 /_index 或者默认的 /_index/_type 。

你仍然可以通过单独请求覆盖这些值：
GET /website/blog/_mget
{
   "docs" : [
      { "_id" : 2 },
      { "_type" : "pageviews", "_id" :   1 }
   ]
}
如果所有文档的 _index 和 _type 都是相同的，你可以只传一个 ids 数组，而不是整个 docs 数组：
GET /test/test/_mget
{
   "ids" : [ "gLKe4X4BLI5cjpF7njkO", "3ZNyM34BLI5cjpF7LQQ3" ]  
}

查询全部文档

GET /test/test/_search

条件查询

GET /test/test/_search
{
    "query" : {
        "match" : { 
            "name" : "我谁"
        }
    }
}
# 备注： match：分词匹配；
         match_phrase：精确匹配一系列单词或者_短语
         match_all: 匹配所有文档，相当于不做筛选
         match_phrase_prefix:  最左前缀查询
         multi_match: 多字段查询
         "multi_match": {
      			"query": "天天向上", 
      			"fields": ["title","text"]
    		}

高亮查询

GET /test/test/_search
{
    "query" : {
        "match_phrase" : {
            "name" : "我谁"
        }
    },
    "highlight": {
        "fields" : {
            "name" : {}
        }
    }
}

分页查询

GET test/_search
{
  "from": 0, // 显示应该跳过的初始结果数量，默认是 0
  "size": 2 //显示应该返回的结果数量，默认是 10
}

查询函数

match_all 查询

match_all 查询简单的匹配所有文档。在没有指定查询方式时，它是默认的查询：

{ "match_all": {}}

它经常与 filter 结合使用—例如，检索收件箱里的所有邮件。所有邮件被认为具有相同的相关性，所以都将获得分值为 1 的中性 _score。

match 查询

如果你在一个全文字段上使用 match 查询，在执行查询前，它将用正确的分析器去分析查询字符串：

{ "match": { "tweet": "About Search" }}

如果在一个精确值的字段上使用它，例如数字、日期、布尔或者一个 not_analyzed 字符串字段，那么它将会精确匹配给定的值：

{ "match": { "age":    26           }}
{ "match": { "date":   "2014-09-01" }}
{ "match": { "public": true         }}
{ "match": { "tag":    "full_text"  }}

对于精确值的查询，你可能需要使用 filter 语句来取代 query，因为 filter 将会被缓存。

multi_match 查询

multi_match 查询可以在多个字段上执行相同的 match 查询：
{
    "multi_match": {
        "query":    "full text search",
        "fields":   [ "title", "body" ]
    }
}

range 查询

range 查询找出那些落在指定区间内的数字或者时间：

{
    "range": {
        "age": {
            "gte":  20,
            "lt":   30
        }
    }
}

被允许的操作符如下：

gt 大于
gte 大于等于
lt 小于
lte 小于等于

term 查询

term 查询被用于精确值匹配，这些精确值可能是数字、时间、布尔或者那些 not_analyzed 的字符串：

{ "term": { "age":    26           }}
{ "term": { "date":   "2014-09-01" }}
{ "term": { "public": true         }}
{ "term": { "tag":    "full_text"  }}

term 查询对于输入的文本不分析，所以它将给定的值进行精确查询。

terms 查询

terms 查询和 term 查询一样，但它允许你指定多值进行匹配。如果这个字段包含了指定值中的任何一个值，那么这个文档满足条件：

{ "terms": { "tag": [ "search", "full_text", "nosql" ] }}

和 term 查询一样，terms 查询对于输入的文本不分析。它查询那些精确匹配的值（包括在大小写、重音、空格等方面的差异）。

注意点： term和terms如果作用与字符串类型的属性上，会导致查询有问题，应为字符串默认是进行了分词解析。如：

GET test/test/_search
{
  "query": {
    "term": {
      "name": "请注意"
    }
  }
}
# 查询结果
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

GET test/test/_search
{
  "query": {
    "term": {
      "name": "请"
    }
  }
}
# 查询结果
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.80259144,
    "hits": [
      {
        "_index": "test",
        "_type": "test",
        "_id": "3ZNyM34BLI5cjpF7LQQ3",
        "_score": 0.80259144,
        "_source": {
          "name": "请注意",
          "age": "456",
          "url": "第二次更新"
        }
      }
    ]
  }
}

按道理 第一个查询语句 应该是有结果的但是没有，就是因为name属性默认被分词解析了，所以查询不到

exists

exists 查询查找那些指定字段中有值 (exists) 的文档。这与SQL中的NOT IS_NULL (exists) 在本质上具有共性：

{
    "exists":   {
        "field":    "title"
    }
}

bool

将多查询组合在一起,成为想要的布尔查询。这与SQL中的and类似

{
    "bool": {
        "must":     { "match": { "title": "how to make millions" }},
        "must_not": { "match": { "tag":   "spam" }},
        "should": [ //
            { "match": { "tag": "starred" }},
            { "range": { "date": { "gte": "2014-01-01" }}}
        ]
    }
}

should

满足语句中的任意语句。这与SQL中的or类似

{
	"should": [ 
	            { "match": { "tag": "starred" }},
	            { "range": { "date": { "gte": "2014-01-01" }}}
	        ]
}

filter

必须匹配，但它以不评分、过滤模式来进行。这些语句对评分没有贡献，只是根据过滤标准来排除或包含文档。

"filter": {
          "range": { "date": { "gte": "2014-01-01" }} 
        }

constant_score

它将一个不变的常量评分应用于所有匹配的文档。它被经常用于你只需要执行一个 filter 而没有其它查询（例如，评分查询）的情况下。
可以使用它来取代只有 filter 语句的 bool 查询。在性能上是完全相同的，但对于提高查询简洁性和清晰度有很大帮助。

{
    "constant_score":   {
        "filter": {
            "term": { "category": "ebooks" } 
        }
    }
}

term 查询被放置在 constant_score 中，转成不评分的 filter。这种方式可以用来取代只有 filter 语句的 bool 查询。

验证查询

查询可以变得非常的复杂，尤其和不同的分析器与不同的字段映射结合时，理解起来就有点困难了。不过 validate-query API 可以用来验证查询是否合法。

GET test/test/_validate/query
{
  "query": {
    "trod": {
      "filter": {
        "term": {
          "age": "456"
        } 
      }
    } 
  }
}

以上 validate 请求的应答告诉我们这个查询是不合法的：

{
  "valid": false
}

理解错误信息

为了找出查询不合法的原因，可以将 explain 参数加到查询字符串中：

GET /gb/tweet/_validate/query?explain 
{
   "query": {
      "tweet" : {
         "match" : "really powerful"
      }
   }
}

explain 参数可以提供更多关于查询不合法的信息。
很明显，我们将查询类型(match)与字段名称 (tweet)搞混了：

{
  "valid" :     false,
  "_shards" :   { ... },
  "explanations" : [ {
    "index" :   "gb",
    "valid" :   false,
    "error" :   "org.elasticsearch.index.query.QueryParsingException:
                 [gb] No query registered for [tweet]"
  } ]
}

理解查询语句

对于合法查询，使用 explain 参数将返回可读的描述，这对准确理解 Elasticsearch 是如何解析你的 query 是非常有用的：

GET /_validate/query?explain
{
   "query": {
      "match" : {
         "tweet" : "really powerful"
      }
   }
}

我们查询的每一个 index 都会返回对应的 explanation ，因为每一个 index 都有自己的映射和分析器：

{
  "valid" :         true,
  "_shards" :       { ... },
  "explanations" : [ {
    "index" :       "us",
    "valid" :       true,
    "explanation" : "tweet:really tweet:powerful"
  }, {
    "index" :       "gb",
    "valid" :       true,
    "explanation" : "tweet:realli tweet:power"
  } ]
}

从 explanation 中可以看出，匹配 really powerful 的 match 查询被重写为两个针对 tweet 字段的 single-term 查询，一个single-term查询对应查询字符串分出来的一个term。
当然，对于索引 us ，这两个 term 分别是 really 和 powerful ，而对于索引 gb ，term 则分别是 realli 和 power 。之所以出现这个情况，是由于我们将索引 gb 中 tweet 字段的分析器修改为 english 分析器。

排序查询

对于被分词解析的字段（一般字符串类型默认解析）不能直接用于排序，必须指定 keyword

GET test/test/_search
{
   "sort": [
    {"age": {"order": "desc"}}
    { "_score": { "order": "desc" }}
    {"name.keyword": {"order": "desc"}} //name字段为字符串类型
  ]
}

在一个Elasticsearch索引里，所有不同类型的同名字段内部使用的是同一个lucene字段存储。如果同一索引下不同类型的相同字段拥有不同的数据类型，势必会导致数据的压缩成本增大，从而影响效率。
举个例子：在同一个index索引下有user和person两个类型，两个类型里面都有一个“deleted”字段
这时你希望同一个索引中"deleted"字段在一个类型里是数值，在另外一个类型里存储布尔值。最后,在同一个索引中，存储仅有小部分字段相同或者全部字段都不相同的文档，会导致数据稀疏，影响Lucene有效压缩数据的能力，导致占用空间更多，影响索引效率 ↩︎

相关标签：经验总结 elasticsearch 搜索引擎 lucene

上一篇： C++ 学习

下一篇：剑指offer：最小的K个数

elasticsearch 中关键属性、字段等说明

文档、索引、类型

倒排索引

查询返回字段解释

新增文档

更新文档

删除文档

查询文档

根据id查询文档

查询全部文档

条件查询

高亮查询

分页查询

查询函数

match_all 查询

match 查询

multi_match 查询

range 查询

term 查询

terms 查询

exists

bool

should

filter

constant_score

验证查询

理解错误信息

理解查询语句

排序查询

SQL批量删除指定数据表中的所有字段说明属性

详解Elasticsearch中 ‘store‘, ‘index‘ 属性和 ‘_all‘, ‘_source‘字段

elasticsearch 中关键属性、字段等说明