Elasticsearch是一个实时分布式搜索和分析引擎。它让你以前所未有的速度处理大数据成为可能。它用于全文搜索、结构化搜索、分析以及将这三者混合使用:
1.维基百科使用Elasticsearch提供全文搜索并高亮关键字,以及输入实时搜索(search-as-you-type)和搜索纠错(did-you-mean)等搜索建议功能。
2.美国卫报使用Elasticsearch结合用户日志和社交网络数据提供给他们的编辑以实时的反馈,以便及时了解公众对新发表的文章回应。
3.StackOverflow结合全文搜索与地理位置查询,以及more-like-this功能来找到相关的问题和答案。
4.Github使用Elasticsearch检索1300亿行的代码

注脚

展开查看详情

1.

2.Table of Contents 1. Introduction 2. 门 i. ii. iii. API iv. v. vi. vii. viii. 结 ix. x. 结语 3. i. ii. iii. iv. 转 v. 扩 vi. 扩 vii. 应对 4. i. ii. iii. 获 iv. v. vi. 创 vii. 删 viii. ix. x. Mget xi. xii. 结语 5. 删 查 i. ii. iii. 删 iv. 检 v. vi. 请 vii. 6. i. ii. 类 iii. 页 iv. 查询 7. i. 类 ii. 值对 iii. iv. v.

3. vi. 类 8. 结构 查询 i. 请 查询 ii. 结构 查询 iii. 查询 过滤 iv. 查询 v. 过滤查询 vi. 验证查询 vii. 结语 9. i. ii. iii. iv. 10. i. 查询阶 ii. 阶 iii. 选项 iv. 扫 滚 11. i. 创 删 ii. 设 iii. iv. 义 v. vi. 对 vii. source viii. all ix. ID x. 动态 xi. 义动态 xii. 认 xiii. xiv. 别 12. i. ii. 动态 iii. 实时 iv. 变 v. 13. 结构 i. 查询 值 ii. 组 过滤 iii. 查询 值 iv. v. 围 vi. 处 Null 值 vii. 缓 viii. 过滤顺 14. i. 查询 ii. 词查询 iii. 组 查询 iv. 尔 v. vi.

4.vii. 联

5.Elasticsearch 权 阅读 Elasticsearch权 书 Elasticsearch the definitive guide clinton gormley zachary tong 译 Looly 译 @iridiumcao @cvvnx1 @conan007ai @sailxjx @wxlfight @xieyunzi @xdream86 @conan007ai @williamzhao @dingusxp 谢 译 伙 们~~ 邮 loolly@gmail.com @ 项 https://github.com/looly/elasticsearch-definitive-guide-cn http://git.oschina.net/loolly/elasticsearch-definitive-guide-cn 阅读

6.http://es.xiaoleilu.com/ 说 Elasticsearch 简单 译 时 统 习 语 较 译 欢 issue: github git@osc 译 键 约 index -> type -> 类 token -> filter -> 过滤 analyser -> Pull Request 对Pull Request @numbbbbb The Swift Programming Language 协 谢 1. fork 项 2. fork过 项 项 clone 3. 运 git remote add looly git@github.com:looly/elasticsearch-definitive-guide-cn.git 库 为远 库 4. 运 git pull looly master 5. 译 6. commit push 库 git push origin master 7. 陆Github 页 pull request 钮 击 说 1~3 执 译 须执 4 库 这样 执 5~7 1. 还 译 gitbook 经 译 节 陆续 gitbook 2. 为 译 译 贝

7. 门 Elasticsearch 实时 让 处 为 结构 这 维 Elasticsearch 键 输 实时 (search-as-you-type) 纠错(did-you- mean) 议 卫报 Elasticsearch结 户 络 给 们 编辑 实时 馈 时 众对 发 应 StackOverflow结 查询 more-like-this 问题 Github Elasticsearch检 1300亿 码 Elasticsearch 仅 业 还让 DataDog Klout这样 创业 变 扩 Elasticsearch 笔记 运 计 务 处 PB级别 Elasticsearch 项 术 创 统 库这 经 这 术 实时 应 对 户 门槛 长 这 书 说 经 够 对这 现 库 识 显 们 够 过时间 过滤 们 够进 处 义词 给 吗 们 结 吗 们 进 线 对 实时处 吗 这 Elasticsearch Elasticsearch 浏览 让 烂 库 为 库 实 难查询 Elasticsearch 认识

8.为 懂 Elasticsearch Apache Lucene(TM) 论 还 专 领 Lucene 认为 为 进 库 Lucene 库 须 Java 为 发语 应 Lucene 杂 检 识 Elasticsearch Java 发 Lucene 为 实现 过简单 RESTful API 隐 Lucene 杂 让 变 简单 过 Elasticsearch 仅仅 Lucene 们还 这样 实时 储 实时 扩 务 处 PB级结构 结构 这 务 应 过简单 RESTful API 语 户 Elasticsearch 许 值 对 隐 杂 论 习 产环 Elasticsearch Apache 2 license 许 费 载 对Elasticsearch 问题领 Elasticsearch 级 这 历 Shay Banon 刚结 业 发 伦 习 师 过 为 给 构 谱 构 Lucene Lucene 较 难 Shay Lucene 码 Java 员 应 发 项 “Compass” Shay 这 处 环 实时 Compass库 为 务 Elasticsearch 现 2010 2 Elasticsearch 经 为Github 欢 项 码贡 过300 营Elasticsearch 们 边 业 边 发 过Elasticsearch 远 对 Shay 她 谱 ……

9. Elasticsearch Elasticsearch 运 让 们 Elasticsearch Java www.java.com elasticsearch.org/download 载 Elasticsearch curl -L -O http://download.elasticsearch.org/PATH/TO/VERSION.zip <1> unzip elasticsearch-$VERSION.zip cd elasticsearch-$VERSION 1. elasticsearch.org/download 获 URL 产环 时 还 Debian RPM 这 downloads page Puppet module Chef cookbook Marvel Marvel Elasticsearch 监 发环 费 Sense 户 过浏览 Elasticsearch进 Elasticsearch线 码 带 View in Sense 链 击进 Sense 应 实 Marvel 须 过 Elasticsearch 运 码 书 动 Marvel Elasticsearch 录 运 载 ./bin/plugin -i elasticsearch/marvel/latest 监 过 闭Marvel echo 'marvel.agent.enabled: false' >> ./config/elasticsearch.yml 运 Elasticsearch Elasticsearch 经 备 绪 执 启动 ./bin/elasticsearch 护进 运 -d 另 终 进 测试 curl 'http://localhost:9200/?pretty'

10. { "status": 200, "name": "Shrunken Bones", "version": { "number": "1.4.0", "lucene_version": "4.10" }, "tagline": "You Know, for Search" } 这说 ELasticsearch 经启动 运 们 实验 节 节 (node) 运 Elasticsearch实 (cluster) 组 cluster.name 节 们协 转 扩 节 组 cluster.name 认值 这样 启动 节 络 另 过 config/ 录 elasticsearch.yml 启ELasticsearch 这 Elasticsearch 运 Ctrl-C 键终 调 shutdown API 闭 curl -XPOST 'http://localhost:9200/_shutdown' 查 Marvel Sense Marvel 为 监 浏览 过 访问 http://localhost:9200/_plugin/marvel/ Marvel 过 击 dashboards 单 访问Sense 发 访问 http://localhost:9200/_plugin/marvel/sense/

11. Elasticsearch Elasticsearch Java Java API Elasticsearch为Java 户 两 户 节 户 (node client) 节 户 节 (none data node) 换 储 够 转发请 对应 节 传输 户 (Transport client) 这 轻 传输 户 够发 请 远 简单转发请 给 节 两 Java 户 过9300 Elasticsearch传输协议(Elasticsearch Transport Protocol) 节 间 过9300 进 节 组 TIP Java 户 Elasticsearch 须 节 则 们 识别 Java API 请查 节 Java API HTTP协议 JSON为 RESTful API 语 RESTful API 过9200 Elasticsearch进 欢 WEB 户 实 见 过 curl Elasticsearch NOTE Elasticsearch 语 户 ——Groovy Javascript .NET PHP Perl Python Ruby ——还 户 这 Elasticsearch发 请 组 HTTP请 样 curl -X<VERB> '<PROTOCOL>://<HOST>/<PATH>?<QUERY_STRING>' -d '<BODY>' VERB HTTP GET , POST , PUT , HEAD , DELETE PROTOCOL http https协议 Elasticsearch https 时 HOST Elasticsearch 节 节 localhost PORT Elasticsearch HTTP 务 认为9200 QUERY_STRING 选 查询请 ?pretty 请 观 读 JSON BODY JSON 请 请 话 举 说 为 计 们 这样 curl -XGET 'http://localhost:9200/_count?pretty' -d ' { "query": { "match_all": {} } }

12. ' Elasticsearch 类 200 OK HTTP 态码 JSON 响应 HEAD 请 请 JSON 响应 { "count" : 0, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 } } 们 HTTP头 为 们 让 curl 显 们 显 curl -i : curl -i -XGET 'localhost:9200/' 对 书 们 简 curl 请 还 curl 请 curl -XGET 'localhost:9200/_count?pretty' -d ' { "query": { "match_all": {} } }' 们 简 这样 GET /_count { "query": { "match_all": {} } } 实 Sense

13.应 对 简单 键值 时 拥 杂 结构 另 对 组 总 这 对 储 库 这 组 库 现 对 对 应 查询 时 们 Elasticsearch (document oriented) 这 储 对 (document) 仅仅 储 还 (index) Elasticsearch 对 进 过滤 这 这 Elasticsearch 够执 杂 JSON ELasticsearch Javascript对 (JavaScript Object Notation) JSON 为 JSON现 经 语 经 为NoSQL领 标 简洁 简单 阅读 JSON 户对 { "email": "john@smith.com", "first_name": "John", "last_name": "Smith", "info": { "bio": "Eco-warrior and defender of the weak", "age": 25, "interests": [ "dolphins", "whales" ] }, "join_date": "2014/05/01" } user 对 杂 结构 对 义 经 现 JSON Elasticsearch 对 转 为 JSON 结构 简单 NOTE 语 应 块 结构转换为JSON 语 处 细节 请查 “ serialization ” or “ marshalling ”两 处 JSON 块 Elasticsearch 户 动为 JSON

14. 们现 进 简单 绍 (indexing) (search) (aggregations) 过这 们 让 对Elasticsearch 觉 们 陆续 绍 术语 马 们 书 节 讨这 风 Elasticsearch 让 们 员 录 设 们刚 Megacorp 这时 资 门 让 们创 员 录 这 录 进 怀 实时协 够 值 标签 纯 检 员 结构 查 30岁 员 简单 杂 语(phrase) 结 键 够 图 这 员 们 储员 员 Elasticsearch 储 为 (indexing) 过 们 应该 储 哪 Elasticsearch 归 类 (type), 这 类 (index) 们 简单 对 图 类 传统 库 Relational DB -> Databases -> Tables -> Rows -> Columns Elasticsearch -> Indices -> Types -> Documents -> Fields Elasticsearch (indices) 库 类 (types) 类 (documents) (Fields) 义 经 (index)这 词 Elasticsearch 义 : 词 (index) 传统 库 库 储 index indices indexes 动词 储 词 检 查询 这 SQL INSERT 键 别 经 传统 库为 B-Tree 检 Elasticsearch Lucene (inverted index) 结构 达 认 拥 这样 们 们 节 详细 讨论 为 创 员 录 们 进

15. 为 员 (document) 应员 类 为 employee employee 类 归 megacorp megacorp 储 Elasticsearch 实际 这 许 骤 们 过 执 PUT /megacorp/employee/1 { "first_name" : "John", "last_name" : "Smith", "age" : 25, "about" : "I love to go rock climbing", "interests": [ "sports", "music" ] } 们 path: /megacorp/employee/1 说 megacorp employee 类 1 这 员 ID 请 实 JSON 这 员 “John Smith” 25岁 欢 简单 额 创 义 类 们 够 Elasticsearch 经 设 让 们 录 员 PUT /megacorp/employee/2 { "first_name" : "Jane", "last_name" : "Smith", "age" : 32, "about" : "I like to collect rock albums", "interests": [ "music" ] } PUT /megacorp/employee/3 { "first_name" : "Douglas", "last_name" : "Fir", "age" : 35, "about": "I like to build cabinets", "interests": [ "forestry" ] }

16.检 现 Elasticsearch 经 储 们 业务 够检 单 员 这对 Elasticsearch 说 简单 们 执 HTTP GET请 “ ”—— 类 ID 这 们 JSON GET /megacorp/employee/1 响应 John Smith JSON _source { "_index" : "megacorp", "_type" : "employee", "_id" : "1", "_version" : 1, "found" : true, "_source" : { "first_name" : "John", "last_name" : "Smith", "age" : 25, "about" : "I love to go rock climbing", "interests": [ "sports", "music" ] } } 们 过HTTP GET 检 样 们 DELETE 删 HEAD 检查 们 PUT 简单 GET 请 简单—— 轻 获 让 们 进 尝试 东 简单 们尝试 简单 员 请 GET /megacorp/employee/_search 们 megacorp employee 类 们 结 键 _search ID 响应 hits 组 们 认 10 结 { "took": 6, "timed_out": false, "_shards": { ... }, "hits": { "total": 3, "max_score": 1, "hits": [ { "_index": "megacorp", "_type": "employee", "_id": "3", "_score": 1, "_source": { "first_name": "Douglas", "last_name": "Fir", "age": 35, "about": "I like to build cabinets", "interests": [ "forestry" ] } },

17. { "_index": "megacorp", "_type": "employee", "_id": "1", "_score": 1, "_source": { "first_name": "John", "last_name": "Smith", "age": 25, "about": "I love to go rock climbing", "interests": [ "sports", "music" ] } }, { "_index": "megacorp", "_type": "employee", "_id": "2", "_score": 1, "_source": { "first_name": "Jane", "last_name": "Smith", "age": 32, "about": "I like to collect rock albums", "interests": [ "music" ] } } ] } } 响应 仅 诉 们哪 这 — 们 给 户 结 时 让 们 “Smith” 员 这 们 轻 级 这 查询 (query string) 为 们 传递URL 样 传递查询语 GET /megacorp/employee/_search?q=last_name:Smith 们 请 _search 键 查询语 传递给 q= 这样 为Smith 结 { ... "hits": { "total": 2, "max_score": 0.30685282, "hits": [ { ... "_source": { "first_name": "John", "last_name": "Smith", "age": 25, "about": "I love to go rock climbing", "interests": [ "sports", "music" ] } }, { ... "_source": { "first_name": "Jane", "last_name": "Smith", "age": 32, "about": "I like to collect rock albums", "interests": [ "music" ] } } ] } }

18. DSL语 查询 查询 过 (ad hoc) 阅简单 节 Elasticsearch 查询语 DSL查询(Query DSL), 许 构 杂 查询 DSL(Domain Specific Language 领 语 ) JSON请 现 们 这样 “Smith” 查询: GET /megacorp/employee/_search { "query" : { "match" : { "last_name" : "Smith" } } } 这 查询 结 东 变 们 查询 (query string) 为 请 这 请 JSON match 语 查询类 们 杂 们让 变 杂 们 为“Smith” 员 们 龄 30岁 员 们 语 过滤 (filter), 们 执 结构 GET /megacorp/employee/_search { "query" : { "filtered" : { "filter" : { "range" : { "age" : { "gt" : 30 } <1> } }, "query" : { "match" : { "last_name" : "smith" <2> } } } } } <1> 这 查询 间过滤 (range filter), 查 龄 30岁 —— gt 为"greater than" 缩 <2> 这 查询 match 语 (query) 现 语 们 详细 讨论 们 过滤 (filter) 执 间 match 语 现 们 结 显 32岁 “Jane Smith” 员 { ... "hits": { "total": 1, "max_score": 0.30685282, "hits": [ { ... "_source": { "first_name": "Jane", "last_name": "Smith", "age": 32, "about": "I like to collect rock albums", "interests": [ "music" ] } } ]

19. } } 为 简单 过 龄筛选 让 们尝试 级 —— 传统 库 难实现 们 欢“rock climbing” 员 GET /megacorp/employee/_search { "query" : { "match" : { "about" : "rock climbing" } } } 们 match 查询 about "rock climbing" 们 两 { ... "hits": { "total": 2, "max_score": 0.16273327, "hits": [ { ... "_score": 0.16273327, <1> "_source": { "first_name": "John", "last_name": "Smith", "age": 25, "about": "I love to go rock climbing", "interests": [ "sports", "music" ] } }, { ... "_score": 0.016878016, <2> "_source": { "first_name": "Jane", "last_name": "Smith", "age": 32, "about": "I like to collect rock albums", "interests": [ "music" ] } } ] } } <1><2> 结 评 认 Elasticsearch 结 评 对结 进 谓 结 评 查询 显 John Smith about “rock climbing” 为 Jane Smith 现 结 “rock” 她 abuot 为 “rock” “climbing” 她 _score John 这 释 Elasticsearch 进 结 (relevance) Elasticsearch 这 传统 库 为传统 库对记录 查询

20. 语 们 单 词 这 时 单词 语(phrases) 们 查询 时 "rock" "climbing" 邻 员 记录 这 们 match 查询变 为 match_phrase 查询 : GET /megacorp/employee/_search { "query" : { "match_phrase" : { "about" : "rock climbing" } } } 问 该查询 John Smith { ... "hits": { "total": 1, "max_score": 0.23013961, "hits": [ { ... "_score": 0.23013961, "_source": { "first_name": "John", "last_name": "Smith", "age": 25, "about": "I love to go rock climbing", "interests": [ "sports", "music" ] } } ] } } 们 应 欢 结 (highlight) 键 这样 户 为 这 查询 Elasticsearch 让 们 语 highlight GET /megacorp/employee/_search { "query" : { "match_phrase" : { "about" : "rock climbing" } }, "highlight": { "fields" : { "about" : {} } } } 们运 这 语 时 结 结 highlight 这 about <em></em> 标识 单词

21.{ ... "hits": { "total": 1, "max_score": 0.23013961, "hits": [ { ... "_score": 0.23013961, "_source": { "first_name": "John", "last_name": "Smith", "age": 25, "about": "I love to go rock climbing", "interests": [ "sports", "music" ] }, "highlight": { "about": [ "I love to go <em>rock</em> <em>climbing</em>" <1> ] } } ] } } <1> 节阅读

22. 们还 许 职员 录 进 Elasticsearch (aggregations) 许 杂 统计 SQL GROUP BY 举 让 们 职员 兴 爱 GET /megacorp/employee/_search { "aggs": { "all_interests": { "terms": { "field": "interests" } } } } 暂时 语 查询结 { ... "hits": { ... }, "aggregations": { "all_interests": { "buckets": [ { "key": "music", "doc_count": 2 }, { "key": "forestry", "doc_count": 1 }, { "key": "sports", "doc_count": 1 } ] } } } 们 两 职员对 乐 兴 欢 欢运动 这 预 计 们 实时 查询语 动态计 们 "Smith" 兴 爱 们 语 GET /megacorp/employee/_search { "query": { "match": { "last_name": "smith" } }, "aggs": { "all_interests": { "terms": { "field": "interests" } } } } all_interests 经变 查询语 ...

23. "all_interests": { "buckets": [ { "key": "music", "doc_count": 2 }, { "key": "sports", "doc_count": 1 } ] } 许 级汇总 让 们统计 兴 职员 龄 GET /megacorp/employee/_search { "aggs" : { "all_interests" : { "terms" : { "field" : "interests" }, "aggs" : { "avg_age" : { "avg" : { "field" : "age" } } } } } } 虽 这 结 杂 ... "all_interests": { "buckets": [ { "key": "music", "doc_count": 2, "avg_age": { "value": 28.5 } }, { "key": "forestry", "doc_count": 1, "avg_age": { "value": 35 } }, { "key": "sports", "doc_count": 1, "avg_age": { "value": 25 } } ] } 该 结 结 们 兴 该兴 员 现 兴 额 拥 avg_age 显 该兴 员 龄 还 语 觉 过这 杂 处 类

24. 结 这 简 够 Elasticsearch 这 为 简 还 —— 这 构 级 语 让 觉 调 问 书 这 问题 细节 让 Elasticsearch 过

25. 节 们 Elasticsearch 扩 务 处 PB级 们 给 Elasticsearch Elasticsearch为 设计隐 杂 Elasticsearch 统 发现 运 笔记 运 拥 100 节 样 Elasticsearch 隐 统 杂 这 层 动 (shards) 们 节 节 对 负载 丢 节 请 应 节 论 节 还 节 缝 扩 阅读 书时 Elasticsearch 补 节 这 节 给 扩 转 处 储 执 这 节 读 —— 懂这 Elasticsearch 这 够 Elasticsearch 读 们 时 头 阅

26.现 对Elasticsearch Elasticsearch 习 轻 习 Elasticsearch 检 Elasticsearch 产 详细 诉Elasticsearch 应 输 书 级 专 节 阐 专 级别 刚 这 暂时 Elasticsearch 认 户 预 时 时 顾这 节

27. 补 节 这 Elasticsearch 环 补 节 这 节 们 释 术语 (cluster) 节 (node) (shard) Elasticsearch 扩 处 这 读 —— Elasticsearch 时 长时间 远 转 —— Elasticsearch 过这 查阅 Elasticsearch 构 扩 统 扩 购买 务 (纵 扩 (vertical scale or scaling up)) 购买 务 扩 (horizontal scale or scaling out) Elasticsearch虽 获 纵 扩 扩 应该 过 节 摊负载 对 库 扩 动 这 设备 对 说 Elasticsearch 节 扩 这 这 这 们 创 (cluster) 节 (node) (shards) 进 扩 证 时

28. 们启动 单 节 还 这 图1 图1 节 节 (node) Elasticsearch实 (cluster) 节 组 们 cluster.name 们协 负载 节 删 节 时 节 选举为 节 (master), 临时 级别 变 删 节 节 级别 变 这 长 时 该 节 为 颈 节 为 节 们 节 节 为 户 们 够 节 节 节 哪 节 们 转发请 应 节 们访问 节 负责 节 给 户 这 Elasticsearch处

29. Elasticsearch 监 统计 (cluster health) 态 green yellow red GET /_cluster/health 运 查询 这 { "cluster_name": "elasticsearch", "status": "green", <1> "timed_out": false, "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 0, "active_shards": 0, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 0 } <1> status 们 兴 status 综 标 务 颜 义 颜 义 green yellow red 节 们 说 (primary shard) (replica shard) 说 这 颜 态 实际 环 义

30.为 Elasticsearch 们 (index)—— 储 联 实际 (shards) “逻辑 间(logical namespace)”. (shard) 级别“ 单 (worker unit)”, 们 详细说 现 们 Lucene实 们 储 们 应 们 Elasticsearch 发 键 储 节 扩 缩 Elasticsearch 动 节 间 (primary shard) (replica shard) 单 储 论 储 实际 储 杂 查询 响应时间 导 丢 时 读请 别 shard 创 时 时调 让 们 节 创 blogs 认 5 为 们 3 PUT /blogs { "settings" : { "number_of_shards" : 3, "number_of_replicas" : 1 } } 带 单 节 们 现 图—— Node 1 们现 检查 (cluster-health) 们 见 { "cluster_name": "elasticsearch", "status": "yellow", <1> "timed_out": false, "number_of_nodes": 1, "number_of_data_nodes": 1, "active_primary_shards": 3, "active_shards": 3,

31. "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 3 <2> } <1> 态现 yellow <2> 们 还 节 态 yellow (primary shards)启动 运 —— 经 处 请 —— (replica shards)还 实 现 unassigned 态—— 们还 给 节 节 这 节 丢 现 们 经 备 导 丢 风险

32. 转 单 节 运 单 风险—— 备 运 单 们 启动另 节 启动 节 为 测试 节 发 节 启动 节 运 Elasticsearch 录—— 节 启动 Elasticsearch实 节 节 cluster.name 请 ./config/elasticsearch.yml 动发现 节 检查 哪 问题 这 络 墙 节 们启动 节 这 图 节 —— : 节 经 (replica shards) 经 —— 别对应 这 丢 节 证 储 发 对应 节 这 们 节 节 检 cluster-health 现 态 green 这 6 { "cluster_name": "elasticsearch", "status": "green", <1> "timed_out": false, "number_of_nodes": 2, "number_of_data_nodes": 2, "active_primary_shards": 3, "active_shards": 6, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 0 } <1> 态 green . 们 仅 备

33. 扩 应 长 们该 扩 们启动 节 们 这时 为 节 (cluster-three-nodes) 经 负载 Node 1 Node 2 经 动 Node 3 这样 节 两 这 节 资 CPU RAM I/O 较 这样 现 单 节 资 这6 3 们 扩 6 节 节 这样 100% 这 节 资

34. 扩 扩 们 6 节 创 时 经给 实际 这 义 储 实际 读请 —— 检 —— 够 过 处 们 处 运 动态 变 这 许 们 扩 缩 规 让 们 1 变 2 PUT /blogs/_settings { "number_of_replicas" : 2 } number_of_replicas 2 图 blogs 现 9 6 这 们 够扩 9 节 变 节 这样 们 标 节 扩 样 节 们 为 访问 节 资 译 请 节 导 节 过这 额 节 们 过 对 设 们 够 两 节 丢

35.应对 们 经说过Elasticsearch 应对节 让 们继续尝试 们杀 节 进 简 杀 节 们杀 节 节 须 节 让 发 节 选举 节 Node 2 1 2 们杀 Node 1 时 经丢 们 丢 节 时 时 们检查 们 态 red 节 运 丢 两 贝 节 还 节 这 Node 2 Node 3 为 yellow 态 这 间 为 态 yellow green 们 们 对应两 义 这 们达 green 态 过 这 们杀 Node 2 们 丢 运 为 Node 3 还 贝 们 启 Node 1 够 丢 结 态 节 Node 1 节 贝 尝试 们 间 变 现 应该对 Elasticsearch 扩 证 认识 们 讨论 细节

36. 论 图 样 组织 为 们 标 务 节组 们 节 间 联 现实 实 “ 东 ” Email 义 现实 类 实 样 电话 码 另 码 两 Email 两 语 对 编 语 们 对 处 现实 杂结构 实 为 这 还 错 们 储这 实 时问题 传统 们 储 库 电 这 储 导 对 对 储对 对 围绕 为 们 们 专 对 对 (object) 语 记录 结构 为 络间发 储 们 标 JSON (JavaScript Object Notation) 读 对 经 为NoSQL 换 实标 对 为JSON 为JSON (JSON document) Elasticsearch (document) 储 实时 储 检 杂 结构—— JSON 换 说 储 Elasticsearch 节 检 们 仅 储 还 查询 虽 经 NoSQL 许 们 储对 们 虑 查询这 哪 检 时 Elasticsearch 认 说 专门 检 库 查询 这 惊 结 这 们 讨 API 创 检 删 们 查询 们 们 Elasticsearch 储 让 们

37. 实 对 够 为 键值对 JSON对 键(key) (field) (property) 值 (value) 尔类 另 对 值 组 类 对 { "name": "John Smith", "age": 42, "confirmed": true, "join_date": "2014-06-01", "home": { "lat": 51.5, "lon": 0.1 }, "accounts": [ { "type": "facebook", "id": "johnsmith" }, { "type": "twitter", "id": "johnsmith" } ] } 们 认为对 (object) (document) 过 们还 别 对 (Object) JSON结 构 ——类 hashmap 联 组 对 (Object) 还 对 (Object) Elasticsearch (document)这 术语 义 顶层结构 对 (root object) JSON ID标识 储 Elasticsearch 还 (metadata)—— 须 节 节 说 _index 储 _type 对 类 _id 标识 _index (index)类 库 “ 库”—— 们 储 联 实 们 储 (shards) 组 逻辑 间 这 细节—— 们 对 们 储 (index) 细节 Elasticsearch 们 节 讨 创 现 们 让Elasticsearch为 们创 们 仅仅 选择 这 须 线 头 让 们 website 为 _type

38. 应 们 对 “ ” 户 评论 邮 对 类 (class) 这 类 义 对 联 user 类 对 别 龄 Email 库 们经 类 对 储 为 们 结构 Elasticsearch 们 类 (type) “ ” 为 们 结构 类 (type) (mapping) 结构 义 传统 库 样 类 储 类 (mapping) 诉Elasticsearch 们 节 讨 义 现 们 赖ELasticsearch 动处 结构 _type 线 们 `blog 为类 _id id仅仅 _index _type 组 时 ELasticsearch 标识 创 义 _id 让Elasticsearch 动 还 们 节 讨 们 经 Elasticsearch 储 过ID检 ——换 说 Elasticsearch 为 储

39. 过 index API —— 储 们 们讨论 过 _index _type _id 们 _id index API 为 们 ID 标识 user_account 值 _id 这 index API PUT /{index}/{type}/{id} { "field": "value", ... } 们 “website” 类 “blog” 们选择 ID “123” 这 请 这样 PUT /website/blog/123 { "title": "My first blog entry", "text": "Just trying this out...", "date": "2014/01/01" } Elasticsearch 响应 { "_index": "website", "_type": "blog", "_id": "123", "_version": 1, "created": true } 响应 请 经 创 这 _index _type _id _version Elasticsearch 变 删 _version 节 们 讨 _version 另 ID 们 ID 们 让Elasticsearch 动为 们 请 结构发 变 PUT —— “ 这 URL 储 ” 变 POST —— " 这 储 " 译 储 ID对应 间 现 这 _type URL现 _index _type 两 POST /website/blog/ { "title": "My second blog entry", "text": "Still trying this out...", "date": "2014/01/01" } 响应 刚 类 _id 变 动 值

40.{ "_index": "website", "_type": "blog", "_id": "wM0OSFhDQXGZAWDf0-drSA", "_version": 1, "created": true } 动 ID 22 长 URL-safe, Base64-encoded string universally unique identifiers, UUIDs

41.检 Elasticsearch 获 们 样 _index _type _id HTTP 为 GET GET /website/blog/123?pretty 响应 现 节 _source 创 时 们发 给Elasticsearch { "_index" : "website", "_type" : "blog", "_id" : "123", "_version" : 1, "found" : true, "_source" : { "title": "My first blog entry", "text": "Just trying this out...", "date": "2014/01/01" } } pretty 查询 pretty 类 让Elasticsearch 输 (pretty-print)JSON响应 阅读 _source 样 们输 GET请 响应 {"found": true} 这 经 们请 JSON 过 found 值变 false HTTP响应 态码 变 '404 Not Found' '200 OK' 们 curl -i 响应头 curl -i -XGET http://localhost:9200/website/blog/124?pretty 现 响应类 这样 HTTP/1.1 404 Not Found Content-Type: application/json; charset=UTF-8 Content-Length: 83 { "_index" : "website", "_type" : "blog", "_id" : "124", "found" : false } 检 GET 请 储 _source 兴 title 请 别 _source GET /website/blog/123?_source=title,text _source 现 们请 过滤 date

42.{ "_index" : "website", "_type" : "blog", "_id" : "123", "_version" : 1, "exists" : true, "_source" : { "title": "My first blog entry" , "text": "Just trying this out..." } } _source 这样请 GET /website/blog/123/_source 仅仅 : { "title": "My first blog entry", "text": "Just trying this out...", "date": "2014/01/01" }

43.检查 检查 —— 对 兴 —— HEAD GET HEAD 请 响应 HTTP头 curl -i -XHEAD http://localhost:9200/website/blog/123 Elasticsearch 200 OK 态 HTTP/1.1 200 OK Content-Type: text/plain; charset=UTF-8 Content-Length: 0 404 Not Found curl -i -XHEAD http://localhost:9200/website/blog/124 HTTP/1.1 404 Not Found Content-Type: text/plain; charset=UTF-8 Content-Length: 0 这 查询 另 进 这 间 创

44. Elasticsearch 变 —— 们 们 们 节 index API (reindex) 换 PUT /website/blog/123 { "title": "My first blog entry", "text": "I am starting to get the hang of this...", "date": "2014/01/02" } 响应 们 Elasticsearch _version { "_index" : "website", "_type" : "blog", "_id" : "123", "_version" : 2, "created": false <1> } <1> created 标识为 false 为 类 经 ID Elasticsearch 经标记 为删 访问 Elasticsearch 继续 时 删 们 讨 update API 这 API 许 实 Elasticsearch 说 过 这 过 1. 检 JSON 2. 3. 删 4. update API 这 过 户 请 get index 请

45.创 们 创 还 经 请记 _index _type _id 证 简单 POST 让 Elasticsearch 动 _id POST /website/blog/ { ... } 义 _id 们 须 诉Elasticsearch应该 _index _type _id 时 请 为 这 两 们 实 选择 op_type 查询 PUT /website/blog/123?op_type=create { ... } URL /_create 为 PUT /website/blog/123/_create { ... } 请 创 Elasticsearch 响应 态码 201 Created 另 _index _type _id 经 Elasticsearch 409 Conflict 响应 态码 错误 类 { "error" : "DocumentAlreadyExistsException[[website][4] [blog][123]: document already exists]", "status" : 409 }

46.删 删 语 过 DELETE DELETE /website/blog/123 Elasticsearch 200 OK 态码 响应 _version 经 { "found" : true, "_index" : "website", "_type" : "blog", "_id" : "123", "_version" : 3 } 们 404 Not Found 态码 响应 这样 { "found" : false, "_index" : "website", "_type" : "blog", "_id" : "123", "_version" : 4 } —— "found" 值 false —— _version 这 记录 节 间 顺 删 盘 标记 删 Elasticsearch 时 进 删

47.处 index API 时 们读 (whole document) 请 ——Elasticsearch 储 时 这 们 丢 时 这 问题 许 们 储 库 贝 Elasticsearch 为 许两 时 尔 丢 对 们 说 时丢 严 问题 们 Elasticsearch 储 线 库 销 Elasticsearch 库 减 销 间 们 销 两 时运 web进 两 时处 订单

48.web_1 让 stock_count 为 web_2 觉 stock_count 贝 经过 译 web_1 减

49. stock_count web_1 stock_count 这 经 过 web_2 stock_count 时这 错 这样 卖 东 实 卖 两 这 应该 读 结 们认为 实还 终顾 为销 给 们 东 变 频 读 间 时间 长 丢 们 库 两 发 时 丢 观 发 Pessimistic concurrency control 这 库 设 经 发 为 们 访问 块 读 锁 这 锁 线 这 乐观 发 Optimistic concurrency control Elasticsearch 设 经 发 块 访问 读 过 发 变 败 这时 败 实际 尝试 读 馈给 户 乐观 发 Elasticsearch 创 删 节 Elasticsearch 这 请 发 (out of sequence) 达 这 远 们 index get delete 请 时 们 _version 码 这 码 变时 Elasticsearch 这 _version 证 现 简单 们 _version 这 优 为 丢 们 version 现 们 请 败 Let's create a new blog post: 让 们创 PUT /website/blog/1/_create { "title": "My first blog entry", "text": "Just trying this out..." } 响应 诉 们这 _version 1 现 设 们 编辑这 载 web 单 们检 GET /website/blog/1 响应 _version 1 { "_index" : "website", "_type" : "blog", "_id" : "1", "_version" : 1, "found" : true, "_source" : { "title": "My first blog entry", "text": "Just trying this out..." }

50. } 现 们 过 时 们这样 version PUT /website/blog/1?version=1 <1> { "title": "My first blog entry", "text": "Starting to get the hang of this..." } <1> 们 _version 1 时 This request succeeds, and the response body tells us that the _version has been incremented to 2 : 请 响应 诉 们 _version 经 2 { "_index": "website", "_type": "blog", "_id": "1", "_version": 2 "created": false } 们 运 请 version=1 Elasticsearch 409 Conflict 态 HTTP响应 响应 类 这样 { "error" : "VersionConflictEngineException[[website][2] [blog][1]: version conflict, current [2], provided [1]]", "status" : 409 } 这 诉 们 _version 2 们 1 们 们 户 应该 对 stock_count 们 检 请 删 请 version 许 码 乐观锁 统 见 结构 库 为 库 Elasticsearch 这 库发 变 贝 Elasticsearch 进 负责这 发问题 库 —— 类 timestamp —— Elasticsearch 查询 version_type=external 这 须 9.2e+18 ——Java long 说 处 时 检查 _version 请 检查 请 储 _version 仅 删 请 创 (create) 创 5 们 这样 PUT /website/blog/2?version=5&version_type=external

51. { "title": "My first external blog entry", "text": "Starting to get the hang of this..." } 响应 们 _version 码 5 { "_index": "website", "_type": "blog", "_id": "2", "_version": 5, "created": true } 现 们 这 version 码为 10 PUT /website/blog/2?version=10&version_type=external { "title": "My first external blog entry", "text": "This is a piece of cake..." } 请 设 _version 为 10 { "_index": "website", "_type": "blog", "_id": "2", "_version": 10, "created": false } 运 这 请 样 错误 为 Elasticsearch

52. 们说 过检 这 对 update API 们 请 实现 们 说过 变 —— 们 换 update API 须 规则 们 们 说 样简单 update API处 检 - - 们 减 进 导 简单 update 请 单 doc 现 ——对 标 举 们 请 为 tags views POST /website/blog/1/_update { "doc" : { "tags" : [ "testing" ], "views": 0 } } 请 们 类 index 请 响应结 { "_index" : "website", "_id" : "1", "_type" : "blog", "_version" : 3 } 检 显 _source { "_index": "website", "_type": "blog", "_id": "1", "_version": 3, "found": true, "_source": { "title": "My first blog entry", "text": "Starting to get the hang of this...", "tags": [ "testing" ], <1> "views": 0 <1> } } <1> 们 经 _source Groovy 这时 API 满 时 Elasticsearch 许 实现 逻辑 API 过请 检 .scripts 盘 载 执 认 语 Groovy 语 语 类 Javascript (sandbox) 运 恶 户 Elasticsearch 击 务 获

53. 够 update API 变 _source ctx._source 们 views POST /website/blog/1/_update { "script" : "ctx._source.views+=1" } 们还 标签 tags 组 这 们 义 标签 为 编码 这 许Elasticsearch 标签时 须 编译 POST /website/blog/1/_update { "script" : "ctx._source.tags+=new_tag", "params" : { "new_tag" : "search" } } 获 两 请 { "_index": "website", "_type": "blog", "_id": "1", "_version": 5, "found": true, "_source": { "title": "My first blog entry", "text": "Starting to get the hang of this...", "tags": ["testing", "search"], <1> "views": 1 <2> } } <1> search 标签 经 tags 组 <2> views 经 过设 ctx.op 为 delete 们 删 POST /website/blog/1/_update { "script" : "ctx.op = ctx._source.views == count ? 'delete' : 'none'", "params" : { "count": 1 } } 们 Elasticsearch 储浏览 计 户访问页 们 这 页 浏览 这 页 们 这 计 们试图 败 这 们 upsert 义 时 创 POST /website/pageviews/1/_update { "script" : "ctx._source.views+=1", "upsert": { "views": 1 } }

54. 执 这 请 upsert 值 为 views 为 1 . 经 script views 这这 节 绍 们 绍 检 (retrieve) (reindex) 减 变 发 过这 进 update 进 时 这 发 为 丢 update API 检 (retrieve)阶 检 _version (reindex)阶 过 index 请 进 检 (retrieve) (reindex)阶 _version 败 对 户 紧 两 进 页 浏览 顺 们 —— 发 们 仅仅 尝试 这 过 retry_on_conflict 设 试 动 这样 update 发 错误 试——这 值 认为 0 POST /website/pageviews/1/_update?retry_on_conflict=5 <1> { "script" : "ctx._source.views+=1", "upsert": { "views": 0 } } <1> 错误发 试 5 这 计 这 顺 还 顺 index API “ (last- write-wins)” update API version 许 乐观 发 (optimistic concurrency control) 细

55.检 Elasticsearch 样 检 请 请 单 络 销 Elasticsearch 检 对 检 请 multi-get mget API mget API docs 组 组 节 义 _index _type _id 检 义 _source GET /_mget { "docs" : [ { "_index" : "website", "_type" : "blog", "_id" : 2 }, { "_index" : "website", "_type" : "pageviews", "_id" : 1, "_source": "views" } ] } 响应 docs 组 还 响应 们 请 义 顺 这样 响应 单 get request响应 { "docs" : [ { "_index" : "website", "_id" : "2", "_type" : "blog", "found" : true, "_source" : { "text" : "This is a piece of cake...", "title" : "My first external blog entry" }, "_version" : 10 }, { "_index" : "website", "_id" : "1", "_type" : "pageviews", "found" : true, "_version" : 2, "_source" : { "views" : 2 } } ] } 检 _index _type URL 义 认 /_index /_index/_type 单 请 这 值 GET /website/blog/_mget { "docs" : [ { "_id" : 2 }, { "_type" : "pageviews", "_id" : 1 } ] }

56.实 _index _type 过简单 ids 组 docs 组 GET /website/blog/_mget { "ids" : [ "2", "1" ] } 们请 们 义 类 为 blog ID为 1 类 为 pageviews 这 响应 { "docs" : [ { "_index" : "website", "_type" : "blog", "_id" : "2", "_version" : 10, "found" : true, "_source" : { "title": "My first external blog entry", "text": "This is a piece of cake..." } }, { "_index" : "website", "_type" : "blog", "_id" : "1", "found" : false <1> } ] } <1> 这 实 响 检 检 报 HTTP请 态码还 200 实 请 还 200 mget 请 检查 found 标

57. 时 mget 许 们 检 样 bulk API 许 们 单 请 实现 create index update delete 这对 类 动这样 们 为 进 bulk 请 寻 { action: { metadata }}\n { request body }\n { action: { metadata }}\n { request body }\n ... 这 类 "\n" 连 JSON (stream) 两 须 "\n" 结 这 为 标记 转义 换 们 扰 ——这 JSON : 们 绍 为 bulk API 这 action/metadata这 义 为(what action)发 哪 (which document) 为(action) 须 为 释 create 时创 详见 创 index 创 换 见 update 见 delete 删 见 删 创 删 时 须 _index _type _id 这 (metadata) 删 请 这样 { "delete": { "_index": "website", "_type": "blog", "_id": "123" }} 请 (request body) _source 组 —— 值 index create 须 这 须 这 还 update 请 组 应该 update API doc , upsert , script 删 请 (request body) { "create": { "_index": "website", "_type": "blog", "_id": "123" }} { "title": "My first blog post" } 义 _id ID 动创 { "index": { "_index": "website", "_type": "blog" }}

58. { "title": "My second blog post" } 为 这 bulk 请 单 这样 POST /_bulk { "delete": { "_index": "website", "_type": "blog", "_id": "123" }} <1> { "create": { "_index": "website", "_type": "blog", "_id": "123" }} { "title": "My first blog post" } { "index": { "_index": "website", "_type": "blog" }} { "title": "My second blog post" } { "update": { "_index": "website", "_type": "blog", "_id": "123", "_retry_on_conflict" : 3} } { "doc" : {"title" : "My updated blog post"} } <2> <1> delete 为(action) 请 紧 另 为(action) <2> 记 换 Elasticsearch响应 items 组 罗 请 结 结 顺 们请 顺 { "took": 4, "errors": false, <1> "items": [ { "delete": { "_index": "website", "_type": "blog", "_id": "123", "_version": 2, "status": 200, "found": true }}, { "create": { "_index": "website", "_type": "blog", "_id": "123", "_version": 3, "status": 201 }}, { "create": { "_index": "website", "_type": "blog", "_id": "EiwfApScQiiy7TIKFxRCTw", "_version": 1, "status": 201 }}, { "update": { "_index": "website", "_type": "blog", "_id": "123", "_version": 4, "status": 200 }} ] }} <1> 请 请 执 请 错误 响 请 请 败 顶层 error 标记 设 为 true 错误 细节 应 请 报 POST /_bulk { "create": { "_index": "website", "_type": "blog", "_id": "123" }} { "title": "Cannot create - it already exists" } { "index": { "_index": "website", "_type": "blog", "_id": "123" }} { "title": "But we can update it" } 响应 们 create 123 败 为 经 123 执 index 请

59. { "took": 3, "errors": true, <1> "items": [ { "create": { "_index": "website", "_type": "blog", "_id": "123", "status": 409, <2> "error": "DocumentAlreadyExistsException <3> [[website][4] [blog][123]: document already exists]" }}, { "index": { "_index": "website", "_type": "blog", "_id": "123", "_version": 5, "status": 200 <4> }} ] } <1> 请 败 <2> 这 请 HTTP 态码 报 为 409 CONFLICT <3> 错误 说 请 错误 <4> 请 态码 200 OK 这 说 bulk 请 —— 们 实现 务 请 时 请 扰 index type 为 mget API bulk 请 URL /_index /_index/_type : POST /website/_bulk { "index": { "_type": "log" }} { "event": "User logged in" } _index _type 时 URL 值 为 认值 POST /website/log/_bulk { "index": {}} { "event": "User logged in" } { "index": { "_type": "blog" }} { "title": "Overriding the default type" } 请 载 们请 节 请 给 请 bulk 请 过这 杂 负载 运 这 (sweetspot)还 试 标 长 说 1000~5000 间 较 请 1kB 1MB 5-15MB 间

60.

61.结语 现 Elasticsearch 储 储 检 删 们 进 这 这 实 们还 动 让 们 讨 论 环

62. 储 们 检 们 过 们 过 许 获 术细节 这 细节 为 —— Elasticsearch 这 们 这 细节 统 储 兴 阅读 为 Elasticsearch 懂 记 细节 讨论 这 选项 给 级 户 阅读这 让 统 让 这 备 细节吓

63. 储 单 Elasticsearch 哪 创 应该 储 1还 2 进 为 们 检 实 简单 shard = hash(routing) % number_of_primary_shards routing 值 认 _id 义 这 routing 过 (remainder) 围 远 0 number_of_primary_shards - 1 这 这 释 为 创 时 义 变 值 远 时 户认为 让 扩 变 难 现实 术 时 让扩 变 们 扩 节讨论 API get index delete bulk update mget routing 义 义 值 —— —— 们 扩 节 说 为 这

64.为 阐 图 们 设 节 bblogs 拥 两 两 节 们 这样 们 够发 请 给 节 节 处 请 节 节 请 转发 节 们 发 请 给 Node 1 这 节 们 为请 节 (requesting node) 们发 请 环 过 节 请 这样 负载

65. 删 删 请 (write) 们 须 们罗 删 顺 骤 1. 户 给 Node 1 发 删 请 2. 节 _id 0 转发请 Node 3 0 这 节 3. Node 3 执 请 转发请 应 Node 1 Node 2 节 节 报 Node 3 报 请 节 请 节 报 给 户 户 响应 时 经 应 选 请 许 这 过 牺 这 选项 为Elasticsearch 经 够 过为 们 阐 replication 认 值 sync 这 导 响应 设 replication 为 async 请 执 给 户 转发请 给 节 节 这 选项 议 认 sync 许Elasticsearch 馈传输 async 为 绪 发 过 请 Elasticsearch过载 consistency 认 尝试 时 规 (quorum) 过 节 节 这 错 络 规 计 int( (primary + number_of_replicas) / 2 ) + 1 consistency 许 值为 one all 认 quorum 过 number_of_replicas 设 义 现 动 节 义 3 节 规

66.int( (primary + 3 replicas) / 2 ) + 1 = 3 2 节 动 够规 删 timeout 时 样 Elasticsearch 现 认 钟 设 timeout 让 终 100 100 30s 30 认 1 这 为 满 quorum 两 动 这 认设 们 单 节 进 为 这 问题 规 number_of_replicas 时

67.检 够 检 们罗 检 顺 骤 1. 户 给 Node 1 发 get请 2. 节 _id 0 0 对应 节 时 转发请 Node 2 3. Node 2 endangered给 Node 1 给 户 对 读请 为 负载 请 节 为 请 选择 —— 环 经 还 这时 报 请 给 户 则

68.update API 结 读 们罗 执 顺 骤 1. 户 给 Node 1 发 请 2. 转发请 节 Node 3 3. Node 3 检 _source JSON 进 retry_on_conflict 设 骤3 则 4. Node 3 时转发 Node 1 Node 2 节 节 报 Node 3 给请 节 给 户 update API还 删 节 routing``replication``consistency timout 转发 给 时 转发 请 转发 记 这 转发 节 们 证 达 顺 发 Elasticsearch转发 仅仅 请 顺 错 误 损

69.mget bulk API 单 类 别 请 节 请 对 请 转发 节 节 应 这 响应组 为 单 响应 给 户 们 罗 过 mget 请 检 顺 骤 1. 户 Node 1 发 mget 请 2. Node 1 为 构 检 请 转发 这 请 Node 1 构 响应 给 户 routing docs 设 们 罗 bulk 执 create index delete update 请 顺 骤 1. 户 Node 1 发 bulk 请 2. Node 1 为 构 请 转发 这 请 3. 执 执 转发 删 给对应 节 执 节 为报 节 报 给请 节 请 节 响应 给 户 bulk API还 层 replication consistency routing 则 请

70.为 们 习 请 问 “为 bulk API 带换 mget API 样 JSON 组 ” 为 这 问题 们 简单 绍 节 这 (action) 转发 对应 节 单 请 JSON 组 们 JSON为 组 检查 请 应该 哪 为 创 请 组 这 组为 传输 发 请 这 RAM 载 质 还 创 结构 JVM 时间执 垃圾 Elasticsearch则 络缓 读 换 识别 action/metadata 哪 处 这 请 这 请 转发 对应 这 结构 请 过 进

71. —— 为 们 经 elasticsearch 为 简单 NoSQL风 储 —— 们 JSON 扔给Elasticsearch ID检 们 Elasticsearch 处 义 —— 这 为 们 结构 JSON 结构 进 Elasticsearch 储(store) (indexes) 查询 仅 简单查询时 Elasticsearch 结 这让 远 虑传统 库 东 A search can be: (search) 类 gender age 这样 结构 查询 join_date 这样 SQL 结构 查询 样 检 键 联 (relevance) 结 结 两 为 挖 Elasticsearch 释 (Mapping) 释说 (Analysis) 处 领 语 查询(Query DSL) Elasticsearch 查询语 话题 们 阐 们 节 们 绍这 ——仅仅 们 简单 绍 search API. 测试 节测试 这 https://gist.github.com/clintongormley/8579281 这 终 执 实

72. API 单 (empty search) 查询 GET /_search 响应 为 编辑简洁 类 这样 { "hits" : { "total" : 14, "hits" : [ { "_index": "us", "_type": "tweet", "_id": "7", "_score": 1, "_source": { "date": "2014-09-17", "name": "John Smith", "tweet": "The Query DSL is really powerful and flexible", "user_id": 2 } }, ... 9 RESULTS REMOVED ... ], "max_score" : 1 }, "took" : 4, "_shards" : { "failed" : 0, "successful" : 10, "total" : 10 }, "timed_out" : false } hits 响应 hits total 总 hits 组还 10 hits 组 结 _index _type _id _source 这 结 们 这 ID 单 获 节 _score 这 (relevance score) 查询 认 结 联 这 _score 这 们 查询 样 结 _score 间值 1 max_score 查询 _score 值 took took 诉 们 请 费 shards _shards 节 诉 们 查询 total successful 败 failed 们 败 过这 发 们 导 这 响应给 请 这 Elasticsearch 报 failed 继续

73. 结 ==== timeout timeout time_out 值 诉 们查询 时 请 时 响应 结 义 timeout 为 10 10ms 10 1s 1 GET /_search?timeout=10ms Elasticsearch 请 时 结 时 circuit breaker 译 请 timeout 执 查询 仅仅 诉 顺 结 节 闭连 执 查询 结 经 发 时 为对 业务 译 SLA Service-Level Agreement 务 级协议 译为业务 说 为 执 长时间运 查询

74. 类别 结 类 —— user tweet —— —— us gb 过 类 们 Elasticsearch转发 请 结 选择顶 给 们 类 们 过 义URL 类 达 这 这样 /_search 类 /gb/_search gb 类 /gb,us/_search gb us 类 /g*,u*/_search g u 头 类 /gb/user/_search gb 类 user /gb,us/user,tweet/_search gb us 类 为 user tweet /_all/user,tweet/_search user tweet search types user and tweet in all indices 单 时 Elasticsearch转发 请 这 结 样 —— 过 联 5 5 实 样 这 简单 扩 应 变

75. 页 节 诉 们 14 们 语 单 10 hits 组 们 SQL LIMIT 键 页 结 样 Elasticsearch from size size : 认 10 from : 过 结 认 0 页显 5 结 页码 1 3 请 GET /_search?size=5 GET /_search?size=5&from=5 GET /_search?size=5&from=10 应该 页 请 结 结 记 请 结 们 统 页 为 为 页 问题 让 们 设 5 们请 结 页 结 1 10 时 产 顶 10 结 们给请 节 (requesting node) 这 50 结 选 顶 10 结 现 设 们请 1000页——结 10001 10010 须产 顶 10010 结 请 节 这50050 结 丢 50040 统 结 费 页 长 这 为 络 语 1000 结 TIP 节 们 阐 检

76.简 search API 两 单 “简 ” 查询 (query string) 过查询 义 另 JSON 请 (request body) 这 语 结构 查询语 DSL 查询 对 运 对 (ad hoc)查询 别 这 语 查询 类 为 tweet tweet elasticsearch GET /_all/tweet/_search?q=tweet:elasticsearch 语 查 name "john" tweet "mary" 结 实际 查询 +name:john +tweet:mary 编码(percent encoding) 译 url编码 查询 变 GET /_search?q=%2Bname%3Ajohn+%2Btweet%3Amary "+" 缀 语 须 满 类 "-" 缀 须 满 + - 选 —— _all "mary" 简单 GET /_search?q=mary 们 tweet name 结 这 语 结 "mary" 户 “Mary” “Mary”发 针对“@mary” Elasticsearch 设 结 Elasticsearch 值连 为 _all 这 { "tweet": "However did I manage before Elasticsearch?", "date": "2014-09-14", "name": "Mary Jones", "user_id": 1 } 这 们 _all 额 值 "However did I manage before Elasticsearch? 2014-09-14 Mary Jones 1"

77.查询 _all TIP _all 对 应 时 义 _all 结 _all 这 节阐 杂 语 语 _all field name "mary" "john" date 2014-09-10 _all "aggregations" "geo" +name:(mary john) +date:>2014-09-10 +(aggregations geo) 编码 查询 变 阅读 ?q=%2Bname%3A(mary+john)+%2Bdate%3A%3E2014-09-10+%2B(aggregations+geo) 简单(lite)查询 惊 查询语 查询 语 节阐 许 们简洁 杂 查询 这对 查询 发 简洁带 隐 调试 难 ——查询 细 语 错误 - : / " 错 导 错误 结 查询 许 户 运 查询语 瘫 痪 TIP 为这 们 议 查询 给 户 这 户对 产环 们 赖 请 API 们 们 Elasticsearch

78. (mapping) 进 类 认 为 类 ( string , number , booleans , date ) (analysis) 进 (Full Text) 词

79. 处 时 们 东 12 tweets 2014-09-15 们 查询 total hits GET /_search?q=2014 # 12 结 GET /_search?q=2014-09-15 # 还 12 结 ! GET /_search?q=date:2014-09-15 # 1 结 GET /_search?q=date:2014 # 0 结 ! 为 查询 tweets 针对 date 进 查询 为 们 结 查询 _all (译 认 进 查询) date 变 为 们 _all date 导 让 们 Elasticsearch 对 gb tweet 类 进 mapping( 为 义[ 词 义(schema definition)]) 读 们 结构 GET /gb/_mapping/tweet { "gb": { "mappings": { "tweet": { "properties": { "date": { "type": "date", "format": "dateOptionalTime" }, "name": { "type": "string" }, "tweet": { "type": "string" }, "user_id": { "type": "long" } } } } } } Elasticsearch为对 类 进 测 动态 类 显 date 识别为 date 类 _all 为 认 显 过 们 string 类 date 类 string 类 导 查询结 这 让 们觉 惊讶 类 (strings, numbers, booleans dates) 进 这 现实 Elasticsearch 们 别对 别 值(exact values)( string 类 ) (full text) 间 这两 别 -这 库

80. 值(Exact values) vs. (Full text) Elasticsearch 为两 类 值 值 样 date 户ID username email 值 "Foo" "foo" 值 2014 2014-09-15 另 说 ( 类 语 书 ) (Twitter ) 邮 为 结构 实 词 谓 实际 语 结构 问题 语 语 规则 杂 计 难 这 May is fun but June bores me. 说 还 值 查询 为结 进 -- 查询 SQL 达 WHERE name = "John Smith" AND user_id = 2 AND date > "2014-09-15" 对 查询 说 们 询问 这 查询 们 询问 这 查询 换 话说 对 查询 这 们 们 查询 查询 仅 们还 们 图 针对 "UK" 查询 "United Kingdom" 针对 "jump" 查询 时 够 "jumped" "jumps" "jumping" "leap" "johnny walker" "Johnnie Walker" "johnnie depp" "Johnny Depp" "fox news hunting" hunting on Fox News "fox hunting news" fox hunting 闻 为 进 这 类 查询 Elasticsearch 对 (analyzes) 结 们 两 节讨论 过

81.Elasticsearch (inverted index) 结构 现 单词 对 单词 组 们 两 content 1. The quick brown fox jumped over the lazy dog 2. Quick brown foxes leap over lazy dogs in summer 为 创 们 content 为单 单词 们 们 词(terms) (tokens) 译 terms tokens 译 较 语 词 这两 词 结 这 样 Term Doc_1 Doc_2 Quick X The X brown X X dog X dogs X fox X foxes X in X jumped X lazy X X leap X over X X quick X summer X the X 现 们 "quick brown" 们 词 哪 现 Term Doc_1 Doc_2 brown X X quick X ----- ------- ----- Total 2 1 两 项 们 简单 (similarity algorithm) 计 单词 这样 们 说 ——对 们 查询 们 还 问题 1. "Quick" "quick" 认为 单词 户 认为 们 2. "fox" "foxes" "dog" "dogs" —— 们 词 3. "jumped" "leap" 词 —— 们 义词

82. "+Quick +fox" 记 缀 + 单词 须 "Quick" "fox" 查询 "quick fox" "Quick foxes" 译 这 罗嗦 说 单 义词 户 两 查询 们 们 词为统 为标 这样 查询 联 1. "Quick" 转为 为 "quick" 2. "foxes" 转为 ""fox "dogs" 转为 "dog" 3. "jumped" "leap" 义 为单 词 "jump" 现 Term Doc_1 Doc_2 brown X X dog X X fox X X in X jump X X lazy X X over X X quick X X summer X the X X 们还 们 "+Quick +fox" 败 为 "Quick" 值 经 过 们 标 规则处 查询 content 查询 变 "+quick +fox" 这样 两 IMPORTANT 这 实 词 查询 标 为 这 标 过 词(analysis) 这 节 们讨论

83. (analysis) 这样 过 块为 单 词(term) 标 这 词为标 们 “ ” “查 ” 这 (analyzer) (analyzer) 过滤 经过 过滤 (character filter) 们 译 这 词 词 处 过滤 够 HTML标记 转换 "&" 为 "and" 词 词 (tokenizer) 词 为 词 简单 词 (tokenizer) 单词 译 这 过滤 词 过 过滤(token filters) 词 "Quick" 转为 词 词 "a" "and"``"the" 词 义词 "jump" "leap" Elasticsearch 过滤 词 过滤 这 组 创 义 应对 们 义 节详细讨论 过 Elasticsearch还 带 预 们 们 这 词 现 "Set the shape to semi-transparent by calling set_trans(5)" 标 标 Elasticsearch 认 对 对 语 选择 译 啥 对 语 这 够 Unicode Consortium 义 单词边 (word boundaries) 标 词转为 产 结 为 set, the, shape, to, semi, transparent, by, calling, set_trans, 5 简单 简单 单 词转为 产 结 为 set, the, shape, to, semi, transparent, by, calling, set, trans 转换 产 结 为

84. Set, the, shape, to, semi-transparent, by, calling, set_trans(5) 语 语 语 们 够 虑 语 english 带 语 词库—— and the 这 语义 词 这 词 为语 规则 语单词 义 译 stem English words 这 该 译 查 应该 语语 词 响对这 话 english 产 结 set, shape, semi, transpar, call, set_tran, 5 "transparent" "calling" "set_trans" 转为词 们 (index) 为单 词 创 过 们 (search)时 们 让查询 经过 样 处 这 词 查询 们 讨论 义 这样 让 们 查询 (full text) 查询 查询 产 词 查询 值(exact value) 查询 查询 现 为 头 产 结 date 值 单 词 "2014-09-15" _all 过 转为 词 "2014" "09" "15" 们 _all 查询 2014 12 为这 词 2014 GET /_search?q=2014 # 12 results 们 _all 查询 2014-09-15 查询 产 词 2014 09 15 查询语 12 为 们 词 2014 GET /_search?q=2014-09-15 # 12 results ! 们 date 查询 2014-09-15 查询 GET /_search?q=date:2014-09-15 # 1 result 们 date 查询 2014 为 GET /_search?q=date:2014 # 0 results ! 测试 Elasticsearch 时 对 词 储 较 难 为 进

85. analyze API 查 查询 为请 GET /_analyze?analyzer=standard Text to analyze 结 节 词 { "tokens": [ { "token": "text", "start_offset": 0, "end_offset": 4, "type": "<ALPHANUM>", "position": 1 }, { "token": "to", "start_offset": 5, "end_offset": 7, "type": "<ALPHANUM>", "position": 2 }, { "token": "analyze", "start_offset": 8, "end_offset": 15, "type": "<ALPHANUM>", "position": 3 } ] } token 实际 储 词 position 词 现 start_offset end_offset 词 analyze API 对 Elasticsearch 细节 进 们 继续讨论 Elasticsearch 测 动设 为 string standard 总 这样 许 这 语 —— 储 值 类 户ID 态 标签 为 达 这 们 须 过 (mapping) 设 这

86. 节 说 类 (type) 类 拥 (mapping) 义 (schema definition) 义 类 类 Elasticsearch处 还 设 联 类 节 们 讨 细节 这节 们 带 门 简单 类 Elasticsearch 简单 类 类 类 String string Whole number byte , short , integer , long Floating point float , double Boolean boolean Date date —— ——Elasticsearch 动态 测 类 这类 JSON 类 规则 JSON type Field type Boolean: true or false "boolean" Whole number: 123 "long" Floating point: 123.45 "double" String, valid date: "2014-09-15" "date" String: "foo bar" "string" 这 带 —— "123" 为 "string" 类 "long" 类 经 为 "long" 类 Elasticsearch 尝试转换 为long 转换 败时 查 们 _mapping 缀 查 Elasticsearch 们 经 gb 类 tweet GET /gb/_mapping/tweet 这 给 们 (properties) 这 Elasticsearch 创 时动态 { "gb": { "mappings": { "tweet": { "properties": { "date": { "type": "date", "format": "dateOptionalTime" },

87. "name": { "type": "string" }, "tweet": { "type": "string" }, "user_id": { "type": "long" } } } } } } 错误 age 为 string 类 integer 类 查询结 检查 类 设 义 type string 类 type { "number_of_clicks": { "type": "integer" } } string 类 认 虑 们 值 经过 查询 语 处 对 string 两 index analyer index index 值 值 释 analyzed 这 换 not_analyzed 这 值 样 no 这 这 为 string 类 认值 analyzed 们 为 值 们 设 为 not_analyzed { "tag": { "type": "string", "index": "not_analyzed" } } 简单类 —— long double date —— index 应 值 no not_analyzed 们 值

88.对 analyzed 类 analyzer 哪 时 认 Elasticsearch standard 过 whitespace simple english { "tweet": { "type": "string", "analyzer": "english" } } 义 节 们 诉 义 义 创 时 类 时 为 类 为 类 经 这 经 变 经 错误 们 类 analyzed not_analyzed 为 两 让 们 删 gb DELETE /gb 创 tweet 为 english PUT /gb <1> { "mappings": { "tweet" : { "properties" : { "tweet" : { "type" : "string", "analyzer": "english" }, "date" : { "type" : "date" }, "name" : { "type" : "string" }, "user_id" : { "type" : "long" } } } } } <1> 这 创 mappings 请 们 tweet not_analyzed 类 tag _mapping 缀: PUT /gb/_mapping/tweet { "properties" : { "tag" : {

89. "type" : "string", "index": "not_analyzed" } } } 们 经 为 们 们 们 经 测试 过 analyze API测试 对 这两 请 输 GET /gb/_analyze?field=tweet Black-cats <1> GET /gb/_analyze?field=tag Black-cats <1> <1> 们 请 tweet 产 两 词 "black" "cat" , tag 产 单 词 "Black-cats" 换 们

90. 类 简单 标 类 JSON还 null 值 组 对 这 Elasticsearch 值 们 让 tag 这 发 们 标签 组 单 { "tag": [ "search", "nosql" ]} 对 组 值 样对 产 词 这 组 值 须为 类 窜 创 这 组 Elasticsearch 值 类 这 类 Elasticsearch 组 顺 们 顺 _source 顺 样 们 顺 组 为 值 们 顺 阶 “ 值” “ 值” 组 值 (gag of values) ==== Empty fields 组 这 值 实 Lucene null 值 null 值 认为 这 识别为 "empty_string": "", "null_value": null, "empty_array": [], "array_with_null_value": [ null ] 层对 们 讨论 JSON 类 对 (object)—— 语 hashed hashmaps dictionaries associative arrays. 对 (inner objects)经 实 对 另 tweet user_name user_id 们 这样 { "tweet": "Elasticsearch is very flexible", "user": { "id": "@johnsmith", "gender": "male", "age": 26, "name": { "full": "John Smith", "first": "John", "last": "Smith" } } }

91. 对 Elasticsearch 动态 检测 对 们为 object 类 properties { "gb": { "tweet": { <1> "properties": { "tweet": { "type": "string" }, "user": { <2> "type": "object", "properties": { "id": { "type": "string" }, "gender": { "type": "string" }, "age": { "type": "long" }, "name": { <2> "type": "object", "properties": { "full": { "type": "string" }, "first": { "type": "string" }, "last": { "type": "string" } } } } } } } } } <1> 对 . <2> 对 . The mapping for the user and name fields have a similar structure to the mapping for the tweet type itself. In fact, the type mapping is just a special type of object mapping, which we refer to as the root object. It is just the same as any other object, except that it has some special top-level fields for document metadata, like _source , the _all field etc. 对 user name tweet 类 实 type object 类 们 object 为 对 对 样 顶层 _source , _all 对 样 Lucene doesn't understand inner objects. A Lucene document consists of a flat list of key-value pairs. In order for Elasticsearch to index inner objects usefully, it converts our document into something like this: { "tweet": [elasticsearch, flexible, very], "user.id": [@johnsmith], "user.gender": [male], "user.age": [26], "user.name.full": [john, smith], "user.name.first": [john], "user.name.last": [smith] } Inner fields can be referred to by name, eg "first" . To distinguish between two fields that have the same name we can use the full path, eg "user.name.first" or even the type name plus the path: "tweet.user.name.first" . NOTE: In the simple flattened document above, there is no field called user and no field called user.name . Lucene only indexes scalar or simple values, not complex datastructures. [[object-arrays]] ==== Arrays of inner objects Finally, consider how an array containing inner objects would be indexed. Let's say we have a followers array which looks

92.like this: [source,js] { "followers": [ { "age": 35, "name": "Mary White"}, { "age": 26, "name": "Alex Jones"}, { "age": 19, "name": "Lisa Smith"} ] } This document will be flattened as we described above, but the result will look like this: [source,js] { "followers.age": [19, 26, 35], "followers.name": [alex, jones, lisa, smith, mary, white] } The correlation between {age: 35} and {name: Mary White} has been lost as each multi-value field is just a bag of values, not an ordered array. This is sufficient for us to ask: Is there a follower who is 26 years old? but we can't get an accurate answer to: Is there a follower who is 26 years old and who is called Alex Jones? Correlated inner objects, which are able to answer queries like these, are called nested objects, and we will discuss them later on in <>.

93.

94.请 查询 简单查询语 (lite) adhoc查询 须 请 查询(request body search )API 这 为 JSON 纳 查询 请 查询( 简 查询) 仅仅 处 查询 还 结 给 户 寻 结 议 查询 们 简单 search API 查询 GET /_search {} <1> <1> 这 查询 查询 样 查询 _all (indices) 类 (types) GET /index_2014*/type1,type2/_search {} from size 进 页 GET /_search { "from": 30, "size": 10 } 带 GET 请 语 ( 别 js) HTTP库 许 GET 请 带 实 户 惊讶 GET 请 许 带 实 http://tools.ietf.org/html/rfc7231#page-24[RFC 7231] 规 HTTP语义 RFC 规 GET 请 许 带 HTTP 务 许这 为 另 ( 别 缓 ) 则 许这 为 Elasticsearch 们倾 GET 查询请 为 们觉 这 词 POST 说 这 为 为 带 GET 请 search API 样 POST 请 类 这样 POST /_search { "from": 30, "size": 10 } 这 样应 带 GET API请 们 续 节 讨论 查询 现 们 仅 查询语义 对 查询 请 查询 许 们 结构 查询Query DSL(Query Domain Specific Language)

95.

96.结构 查询 Query DSL 结构 查询 现 查询语 Elasticsearch 简单 JSON 结构 查询 现Lucene绝 应 产 这 进 查询 查询 阅读 debug 结构 查询 传递 query GET /_search { "query": YOUR_QUERY_HERE } 查询 - {} - match_all 查询 样 GET /_search { "query": { "match_all": {} } } 查询 查询 这 结构 { QUERY_NAME: { ARGUMENT: VALUE, ARGUMENT: VALUE,... } } { QUERY_NAME: { FIELD_NAME: { ARGUMENT: VALUE, ARGUMENT: VALUE,... } } } match 查询 寻 tweet 寻 elasticsearch 员 { "match": { "tweet": "elasticsearch" } } 查询请 这样 GET /_search { "query": { "match": {

97. "tweet": "elasticsearch" } } } 查询 积 样 简单 为 杂 查询语 简单 (leaf clauses)( match ) 查询 ( )进 较 (compound) bool 许 论 must must_not 还 should { "bool": { "must": { "match": { "tweet": "elasticsearch" }}, "must_not": { "match": { "name": "mary" }}, "should": { "match": { "tweet": "full text" }} } } 查询 这 实现 杂 逻辑 实 查询 inbox 标记spam 邮 "business opportunity" 标(starred)邮 { "bool": { "must": { "match": { "email": "business opportunity" }}, "should": [ { "match": { "starred": true }}, { "bool": { "must": { "folder": "inbox" }}, "must_not": { "spam": true }} }} ], "minimum_should_match": 1 } } 这 细节 们 详细 释 为 单 查询 论 简单 还

98.translate by williamzhao 查询 过滤 们讲 结构 查询语 实 们 两 结构 语 结构 查询 Query DSL 结构 过滤 Filter DSL 查询 过滤语 们 过滤语 询问 值 值 created 围 2013 2014 ? status 单词 "published" ? lat_lon 标 过10km ? 查询语 过滤语 问 查询语 询问 值 值 查询语 为 查 full text search 这 词语 查 单词 run runs , running , jog sprint 时 quick , brown fox --- 单词间 该 标识 lucene , search java --- 标识词 该 查询语 计 查询语 给 评 _score 对 进 这 评 结 过滤语 结 -- 简单 运 仅 1 节 这 缓 过滤结 续请 结 查询语 仅 查 还 计 说查询语 过滤语 时 查询 结 缓 亏 简单查询语 级 查询 经过缓 过滤语 风 经过缓 过滤查询 远胜 查询语 执 过滤语 缩 结 细检查过滤 则 说 查询语 进 评 时 过滤语

99. 查询过滤语 Elasticsearch 查询过滤语 们较 们 续 讨论 现 们 绍 这 查询过滤语 term 过滤 term 哪 值 尔值 not_analyzed ( 经 类 ) { "term": { "age": 26 }} { "term": { "date": "2014-09-01" }} { "term": { "public": true }} { "term": { "tag": "full_text" }} terms 过滤 terms term 类 terms 许 值 { "terms": { "tag": [ "search", "full_text", "nosql" ] } } range 过滤 range 过滤 许 们 围查 { "range": { "age": { "gte": 20, "lt": 30 } } } 围 gt :: gte :: lt :: lte :: exists missing 过滤 exists missing 过滤 查 类 SQL语 IS_NULL { "exists": {

100. "field": "title" } } 这两 过滤 针对 经查 时 bool 过滤 bool 过滤 过滤 查询结 尔逻辑 must :: 查询 , and must_not :: 查询 not should :: 查询 , or 这 别继 过滤 过滤 组 { "bool": { "must": { "term": { "folder": "inbox" }}, "must_not": { "term": { "tag": "spam" }}, "should": [ { "term": { "starred": true }}, { "term": { "unread": true }} ] } } match_all 查询 match_all 查询 查询 认语 { "match_all": {} } 查询 过滤 说 检 邮 , _score 为1 match 查询 match 查询 标 查询 查询还 查询 match 查询 查询 match 查询 { "match": { "tweet": "About Search" } } match 值 尔值 not_analyzed 时 为 给 值 { "match": { "age": 26 }} { "match": { "date": "2014-09-01" }} { "match": { "public": true }} { "match": { "tag": "full_text" }}

101. 时 过滤语 为过滤语 缓 们 简单 绍 查询 match 查询 类 "+usid:2 +tweet:search"这样 语 值进 为 语 错误 multi_match 查询 multi_match 查询 许 match 查询 础 时 { "multi_match": { "query": "full text search", "fields": [ "title", "body" ] } } bool 查询 bool 查询 bool 过滤 查询 bool 过滤 给 bool 查询 计 查询 _score 值 must :: 查询 must_not :: 查询 should :: 查询 则 为 查询 title "how to make millions" "tag" 标为 spam 标识为 "starred" 发 为2014 这 类 级 { "bool": { "must": { "match": { "title": "how to make millions" }}, "must_not": { "match": { "tag": "spam" }}, "should": [ { "match": { "tag": "starred" }}, { "range": { "date": { "gte": "2014-01-01" }}} ] } } bool 查询 must 应该 should must should 进 查询

102.查询 过滤 查询语 过滤语 ElasticSearch API 们 许 带 query filter 语 这 语 单 query 语 filter 换 话说 这 语 创 query filter 查询语 查询 过滤语 过滤 查询语 过滤语 辅 说 查询语 过滤 们 换 query filter 这 们 读懂 时构 语 带过滤 查询语 过滤 查询语 说 们 这样 查询语 : { "match": { "email": "business opportunity" } } 们 让这 语 term 过滤 邮 { "term": { "folder": "inbox" } } search API query 语 们 filtered 时 "query" "filter" { "filtered": { "query": { "match": { "email": "business opportunity" }}, "filter": { "term": { "folder": "inbox" }} } } 们 层 query GET /_search { "query": { "filtered": { "query": { "match": { "email": "business opportunity" }}, "filter": { "term": { "folder": "inbox" }} } } } 单 过滤语

103. query 过滤语 邮 时 query GET /_search { "query": { "filtered": { "filter": { "term": { "folder": "inbox" }} } } } 查询语 查询 围 认 match_all 查询 语 GET /_search { "query": { "filtered": { "query": { "match_all": {}}, "filter": { "term": { "folder": "inbox" }} } } } 查询语 过滤 时 filter query 语 带 查询 过滤语 这 语 过 滤 垃圾邮 GET /_search { "query": { "filtered": { "filter": { "bool": { "must": { "term": { "folder": "inbox" }}, "must_not": { "query": { <1> "match": { "email": "urgent business proposal" } } } } } } } } <1> 过滤语 query 查询 bool 过滤 们 过滤语 查询 这 为 语 过滤 时 这 结构

104.验证查询 查询语 变 杂 别 结 难 validate API 验证 查询语 GET /gb/tweet/_validate/query { "query": { "tweet" : { "match" : "really powerful" } } } 请 值 诉 们这 语 { "valid" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 } } 错误 语 错误 explain GET /gb/tweet/_validate/query?explain <1> { "query": { "tweet" : { "match" : "really powerful" } } } <1> explain 语 错误 详 显 们 query 语 match { "valid" : false, "_shards" : { ... }, "explanations" : [ { "index" : "gb", "valid" : false, "error" : "org.elasticsearch.index.query.QueryParsingException: [gb] No query registered for [tweet]" } ] } 查询语 语 话 explain 带 查询语 阅读 查询语 ES 执

105. GET /_validate/query?explain { "query": { "match" : { "tweet" : "really powerful" } } } explanation 为 为 { "valid" : true, "_shards" : { ... }, "explanations" : [ { "index" : "us", "valid" : true, "explanation" : "tweet:really tweet:powerful" }, { "index" : "gb", "valid" : true, "explanation" : "tweet:really tweet:power" } ] } explanation match 为查询 "really powerful" 进 查询 两 词 别 tweet 进 查询 us 这两 词为 "really" "powerful" gb "really" "power" 这 为 们 gb english

106.结语 这 详细 绍 项 见 查询语 说 结构 查询 还 费 时间 ES 级 们 详细讲 讲 还 查询结 进 们 习 对查询结 进 过

107.

108. 认 结 进 -- 这 们 讲 计 们 sort 为 结 进 们 值 ElasticSearch 查询结 值 _score 给 值 认 结 _score 进 时 还 义 值 语 tweets user_id 值 1 GET /_search { "query" : { "filtered" : { "filter" : { "term" : { "user_id" : 1 } } } } } 过滤语 _score 隐 查询 match_all 为 _score 设值为 1 值 对结 时间 这 见 们 sort 进 GET /_search { "query" : { "filtered" : { "filter" : { "term" : { "user_id" : 1 }} } }, "sort": { "date": { "order": "desc" }} } 发现这 两 "hits" : { "total" : 6, "max_score" : null, <1> "hits" : [ { "_index" : "us", "_type" : "tweet", "_id" : "14", "_score" : null, <1> "_source" : { "date": "2014-09-24", ... }, "sort" : [ 1411516800000 ] <2> }, ... }

109.<1> _score 经过计 为 <2> date 转为 结 sort 值 这 date 转为 长 1411516800000 2014-09-24 00:00:00 UTC _score max_score 为 null 计 _score 较 , -- 们 进 时 统计 计 设 track_scores 为 true 认 为缩 "sort": "number_of_children" 值 认 顺 _score 认 级 们 查询语 结 date _score GET /_search { "query" : { "filtered" : { "query": { "match": { "tweet": "manage text search" }}, "filter" : { "term" : { "user_id" : 2 }} } }, "sort": [ { "date": { "order": "desc" }}, { "_score": { "order": "desc" }} ] } 结 值 时 对 值 进 类 级 _score -- 义 值 查询 义 查询 sort GET /_search?sort=date:desc&sort=_score&q=search 为 值 为 值进 时 实这 值 -- 拥 值 备

110. 哪 为 对 值 进 min , max , avg sum 这 说 dates 进 "sort": { "dates": { "order": "asc", "mode": "min" } }

111. 值 analyzed 时 值 这 值 "fine old art" , 终 值 们 词 单词 话 词 类 ElasticSearch 进 时 这 min max 认 min art old 们 样 为 string 进 须 词 not_analyzed 们 对 进 时 还 须 analyzed _source 两 资 费 们 时 这两 现 们 绍 类 fields 这样 们 变 mapping "tweet": { "type": "string", "analyzer": "english" } 变 值 mapping "tweet": { <1> "type": "string", "analyzer": "english", "fields": { "raw": { <2> "type": "string", "index": "not_analyzed" } } } <1> tweet analyzed 变 <2> tweet.raw not_analyzed 现 给 们 tweet 进 tweet.raw 进 GET /_search { "query": { "match": { "tweet": "elasticsearch" } }, "sort": "tweet.raw" } 对 analyzed 进 详 请查阅 类 简

112. 简 们 经讲过 认 结 计 评 对 _score -- _score 评 查询语 为 _score 评 计 查询类 -- 查询语 fuzzy 查询 计 键词 terms 查询 计 键词组 义 们说 计 键词 类 ElasticSearch 义为 TF/IDF 检 词频 / 频 检 词频 :: 检 词 该 现 频 现频 现过5 现过1 频 :: 检 词 现 频 频 检 词 现 现 权 检验 检 词 长 则:: 长 长 长 检 词 现 title 样 词 现 长 content 单 查询 TF/IDF评 标 语查询 检 词 查询 检 词 检 专 yes|no 评 查询 为 查询语 bool 查询 则 查询 计 评 总 评 评 标 调试 杂 查询语 时 评 _score 较 难 ElasticSearch 查询语 explain explain 设为 true 详细 GET /_search?explain <1> { "query" : { "match" : { "tweet" : "honeymoon" }} } <1> explain 让 结 _score 评 explain 为 产 额 时间 义 现 -- 时 顾这 节 们 这块 识 们 查询 { "_index" : "us", "_type" : "tweet", "_id" : "12", "_score" : 0.076713204, "_source" : { ... trimmed ... },

113. } 这 该 哪 节 哪 这对 们 较 为词频 频 计 "_shard" : 1, "_node" : "mzIVYCsqSWCG_M_ZffSs9Q", 值 _explanation 诉 哪 计 让 计 结 详 "_explanation": { <1> "description": "weight(tweet:honeymoon in 0) [PerFieldSimilarity], result of:", "value": 0.076713204, "details": [ { "description": "fieldWeight in 0, product of:", "value": 0.076713204, "details": [ { <2> "description": "tf(freq=1.0), with freq of:", "value": 1, "details": [ { "description": "termFreq=1.0", "value": 1 } ] }, { <3> "description": "idf(docFreq=1, maxDocs=1)", "value": 0.30685282 }, { <4> "description": "fieldNorm(doc=0)", "value": 0.25, } ] } ] } <1> honeymoon 评 计 总结 <2> 检 词频 <3> 频 <4> 长 则 输 explain 结 贵 调试 -- 产环 计 总结 诉 们 "honeymoon" tweet 检 词频 / 频 TF/IDF 这 0 ID 们 释 计 权 计 检 词频 :: 检 词 `honeymoon` `tweet` 现 频 ::

114. 检 词 `honeymoon` `tweet` 现 现总 长 则:: `tweet` 长 -- 长 How long s the d field in this document -- the longer the field, the smaller this number. 杂 查询语 释 杂 过这 们 结 产 JSON explain 难 阅读 转 YAML format=yaml Explain Api explain 选项 时 诉 为 这 为 请 为 /index/type/id/_explain , GET /us/tweet/12/_explain { "query" : { "filtered" : { "filter" : { "term" : { "user_id" : 2 }}, "query" : { "match" : { "tweet" : "honeymoon" }} } } } 们 们还 这样 "failure to match filter: cache(user_id:[2 TO 2])" 说 们 user_id 过滤 该

115. 绍 ElasticSearch 运 这 们 绍 识 们 经 查阅 们 时 对 进 时 ElasticSearch 进 值 时 结构 时 们 检 词 历 时 们 历 值 们 为 ElasticSearch 值 载 这 " " ElasticSearch 载 值 类 载 为 盘 缓 这 请 请 另 载 ElasticSearch 应 场 对 进 对 进 过滤 过滤 计 问 这 -- string 值 邮 值 庆 过 扩 们 节 现 时 们 讲 ElasticSearch 预 载 户 验

116. 执 继续 们 绕 讲 环 执 们 讲 础 删 查(create-read-update- delete CRUD)请 杂 兴 阅读 Elasticsearch 记 这 细节 阅读这 对 统 让 这 备 别 细节 CRUD 处 单 _index , _type routing-value 认 该 _id 组 这 们 哪 这 哪 查询 杂 过查询 们 兴 这 search API 页结 结 须 组 执 过 两 阶 为查询 query then fetch

117.查询阶 查询阶 query phase 查询 执 document 优 队 priority queue 优 队 优 队 priority queue is n top-n document 这 优 队 页 from size 这 请 优 队 够 纳100 document GET /_search { "from": 90, "size": 10 } 这 查询 过 图 查询阶 图1 查询阶 查询阶 1. 户 发 search 请 给 Node 3 , Node 3 创 长 为 from+size 优 级队 2. Node 3 转发这 请 执 这 查询 结 结 为 from+size 优 队 3. document ID 优 队 document 值给协调节 Node 3 Node 3 这 值 优 队 产 结 请 发 节 Node 这 节 变 协调节 这 节 请 们 响应 结 这 结 给 户 节 请 document GET 请 样 请 处 这 结 时 对 续请 协调节 轮询 摊负载 执 查询 长 为 from+size 优 队 ——这 长 结 够满 请 轻 级 结 给协调节 documentID值 值 _score 协调节 这 级 结 优 队 这 终 结 这 查询阶 结

118. 组 对 单 请 够 结 组 对 multiple all 这 ——仅仅

119. 阶 查询阶 别 满 请 document 们 document 这 阶 图 阶 图2 阶 发阶 骤构 1.协调节 别 哪 document 发 GET 请 2. 载document enrich 们 document 协调节 3. document 协调节 结 给 户 协调节 哪 document 实际 actually 们 查询 { "from": 90, "size": 10 } 90 丢 10 这 document 查询请 协调节 为 document get请 发 请 处 查询阶 载document —— _source field 还 结 协调节 结 们汇 单 响应 这 响应 给 户 页 查询 过 虽 过 from size 进 页 围 within limited 还记 须构 长 为 from+size 优 队 这 传 协调节 这 协调节 过对 * (from + size) document进 size document document 对10,000 50,000 结 1,000 5,000页 页 对 够 from 值 过 变 CPU 带宽 议 页 实际 “ 页 ” 两 页 页 标 络 为 们 续 页 页 获 页 务 溃 边缘 实 获 documents 过设 类 scan 这 这 节讨论

120. 选项 查询 query-string 选 够 响 过 preference 爱 preference 许 哪 节 处 请 她 _primary _primary_first _local _only_node:xyz _prefer_node:xyz _shards:2,3 这 search preference 详细 值 们 结 荡问题 the bouncing results problem 结 荡 Bouncing Results timestamp 对 结 两 document timestamp 请 间轮询 这两 document 顺 另 顺 这 为结 荡 bouncing results 问题 户 页 结 顺 发 变 这 问题 对 户总 户 话ID session ID 设 preference timeout 时 协调节 节 问题 请 timeout 诉协调节 结 结 总 请 这 时 ... "timed_out": true, (1) "_shards": { "total": 5, "successful": 4, "failed": 1 (2) }, ... (1) 请 时 (2) 时时间 为 败 —— 许 为 ——这 样 该 _shards routing 选择 值 节 们 释 时 义 routing 证 document 单 户 document 单 时 routing 值 index GET /_search?routing=user_1,user2 这 术 设计 统时 场 们 规 scale 详细讨论

121.search_type 类 虽 query_then_fetch 认 类 类 GET /_search?search_type=count count 计 count 计 类 query 查询 阶 结 满 查询 document 时 这 查询类 query_and_fetch 查询 query_and_fetch 查询 类 查询 阶 骤 这 优 选项 请 标 时 routing 选择 值时 虽 动选择 这 类 这 dfs_query_then_fetch dfs_query_and_fetch dfs 类 预查询 阶 项 频 计 项 频 们 relevance-is- broken 进 讨论这 scan 扫 scan 扫 类 scroll 滚 API连 结 过 实现 们 节scan-and-scroll 扫 滚 讨论

122.扫 滚 scan 扫 类 scroll 滚 API Elasticsearch 结 页 scroll 滚 滚 许 们 阶 续 Elasticsearch 结 结 这 传统 库 cursors 标 滚 时 这 阶 请 对index 过 边 护index 样 时 样 scan 扫 页 对结 获 结 为达 这 scan 扫 扫 让Elasticsearch 还 结 结 为 scan-and-scroll 扫 滚 执 请 search_type 设 scan 传递 scroll 诉Elasticsearch滚 应该 续 长时间 GET /old_index/_search?search_type=scan&scroll=1m (1) { "query": { "match_all": {}}, "size": 1000 } 1 滚 启1 钟 这 请 应 结 Base-64编码 _scroll_id 滚 id 现 们 _scroll_id 传递给 _search/scroll 获 结 GET /_search/scroll?scroll=1m (1) c2Nhbjs1OzExODpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzExOTpRNV9aY1VyUVM4U0 <2> NMd2pjWlJ3YWlBOzExNjpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzExNzpRNV9aY1Vy UVM4U0NMd2pjWlJ3YWlBOzEyMDpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzE7dG90YW xfaGl0czoxOw== (1) 滚 启另 钟 (2) _scroll_id body URL 传递 查询 传递 ?scroll=1m 滚 终 时间 们 执 滚 请 时 给 们 够 时间 处 结 查询 document 这 滚 请 应 结 虽 1000 size 获 document 扫 时 size 应 们 获 size * number_of_primary_shards size* document 滚 请 _scroll_id 滚 请 时 须传递 请 _scroll_id 结 处 document

123.Elasticsearch 户 扫 滚 对这 简单

124.们 经 Elasticsearch 预 计 设 轻 发 应 调 过 应 较长 时间 类 选项 这 绍 类 API 设

125.创 为 们简单 过 创 这 认设 过动态 类 现 们 对这 过 们 创 设 类 为 达 标 们 动创 请 设 类 PUT /my_index { "settings": { ... any settings ... }, "mappings": { "type_one": { ... any mappings ... }, "type_two": { ... any mappings ... }, ... } 实 过 config/elasticsearch.yml 动创 action.auto_create_index: false NOTE 们 绍 样 动预 这 时 结 动创 删 请 删 DELETE /my_index 删 DELETE /index_one,index_two DELETE /index_* 删 DELETE /_all

126. 设 过 义 为 阅读Index Modules reference documentation : Elasticsearch 优 认 这 为 为 这 请 这 两 设 number_of_shards 义 认值 `5` 这 创 number_of_replicas 认 `1` 这 时 跃 们 创 PUT /my_temp_index { "settings": { "number_of_shards" : 1, "number_of_replicas" : 0 } } 们 update-index-settings API 动态 PUT /my_temp_index/_settings { "number_of_replicas": 1 }

127. 设 analysis 创 义 绍 们 绍 转换为 standard 认 对 语 说 错 选择 虑 standard 词 词层级 输 standard 过滤 设计 词 发 lowercase 过滤 转换为 stop 过滤 删 义 词 a the and is 认 词过滤 启 过创 standard 义 设 stopwords 词 语 预 词 们创 es_std 预 义 语 词 PUT /spanish_docs { "settings": { "analysis": { "analyzer": { "es_std": { "type": "standard", "stopwords": "_spanish_" } } } } } es_std 仅仅 们 义 spanish_docs 为 analyze API 测试 们 GET /spanish_docs/_analyze?analyzer=es_std El veloz zorro marrón 简 结 显 词 El 删 { "tokens" : [ { "token" : "veloz", "position" : 2 }, { "token" : "zorro", "position" : 3 }, { "token" : "marrón", "position" : 4 } ] }

128. 义 虽 Elasticsearch 处 过 组 过滤 词 过滤 满 绍 们 顺 执 组 结 过滤 词 过滤 过滤 过滤 让 词 变 “ 洁” 们 HTML 们 HTML 标签 诸 <p> <div> 们 html_strip 过滤 删 HTML 标签 HTML 实 转换 对应 Unicode &Aacute; 转 Á 过滤 词 须 词 词 单 词 terms tokens standard standard 词 单 词 删 标 现 词 为 keyword 词 输 词处 [ whitespace 词 ] 过 [ pattern 词 ] 过 则 达 过滤 词结 传递给 过滤 过滤 删 们 经 过 lowercase stop 过滤 Elasticsearch 选择 stemmer 过滤 单词转 为 们 态 root form ascii_folding 过滤 删 变 très 转为 tres ngram edge_ngram 让 动 们 举 绍 这 词 过滤 们 阐 创 义 创 义 设 样 们预 es_std 们 analysis 过滤 词 过滤 PUT /my_index { "settings": { "analysis": { "char_filter": { ... custom character filters ... }, "tokenizer": { ... custom tokenizers ... }, "filter": { ... custom token filters ... }, "analyzer": { ... custom analyzers ... } } } } 为 们 这样 1. html_strip 过滤 HTML 标签 2. & 换 and 义 mapping 过滤

129. "char_filter": { "&_to_and": { "type": "mapping", "mappings": [ "&=> and "] } } 1. standard 词 单词 2. lowercase 过滤 词转为 3. stop 过滤 义 词 "filter": { "my_stopwords": { "type": "stop", "stopwords": [ "the", "a" ] } } 预 义 词 过滤 组 们 "analyzer": { "my_analyzer": { "type": "custom", "char_filter": [ "html_strip", "&_to_and" ], "tokenizer": "standard", "filter": [ "lowercase", "my_stopwords" ] } } 请 PUT /my_index { "settings": { "analysis": { "char_filter": { "&_to_and": { "type": "mapping", "mappings": [ "&=> and "] }}, "filter": { "my_stopwords": { "type": "stop", "stopwords": [ "the", "a" ] }}, "analyzer": { "my_analyzer": { "type": "custom", "char_filter": [ "html_strip", "&_to_and" ], "tokenizer": "standard", "filter": [ "lowercase", "my_stopwords" ] }} }}} 创 analyze API 测试 GET /my_index/_analyze?analyzer=my_analyzer The quick & brown fox 结 证 们 {

130. "tokens" : [ { "token" : "quick", "position" : 2 }, { "token" : "and", "position" : 3 }, { "token" : "brown", "position" : 4 }, { "token" : "fox", "position" : 5 } ] } 们 诉 Elasticsearch 哪 则 们 过 应 string 类 PUT /my_index/_mapping/my_type { "properties": { "title": { "type": "string", "analyzer": "my_analyzer" } } }

131.类 类 Elasticsearch 组 类 user blogpost 类 库 结构 组 类 string , integer date 这 Lucene 储 们说过类 类 库 这样 类 现 值 阐释 类 Lucene 实现 Lucene 处 Lucene 组简单 键值对组 值 值 类 单 过 转换 值 Lucene 这 值 值 节 们 Lucene 时 值 选择 储 备 类 实现 Elasticsearch 类 这 简单 础 实现 类 类 为 Lucene 类 类 储 _type 们 类 时 Elasticsearch 简单 过 _type 过滤 这 Lucene 样 Elasticsearch 杂 JSON Lucene user 类 name 这 string 类 过 whitespace "name": { "type": "string", "analyzer": "whitespace" } 预 类 阱 实 类 带 预 难 们 两 类 blog_en 语 blog_es 语 两 类 title 类 english 另 spanish 查询 问题 GET /_search { "query": { "match": { "title": "The quick brown fox" } } } 们 两 类 title 查询语 应该 哪 spanish 还 english Elasticsearch title 这对 这 说 对另 说 错误

132. 们 过给 这 错误 —— title_en title_es 查询 类 GET /_search { "query": { "multi_match": { <1> "query": "The quick brown fox", "fields": [ "blog_en.title", "blog_es.title" ] } } } <1> multi_match 查询 执 match 查询 结 查询 english blog_en.title spanish blog_es.title 过综 组 两 结 这 办 对 类 两 发 类 : user { "login": "john_smith" } 类 : event { "login": "2014-06-01" } Lucene 另 视 这两 们试图 event.login Elasticsearch login 值 载 们 绍 值 们 类 尝试 载这 值为 login 这 导 预 结 败 终 为 证 这 议 类 样

133. 对 层 为 对 项 properties 节 线 头 _type , _id _source 设 项 动态处 analyzer , dynamic_date_formats dynamic_templates 设 时应 对 object 类 enabled , dynamic include_in_all 们 经 节 绍过 设 type 类 string date index 应 analyzed 值 not_analyzed 还 no analyzer 时 们 节 绍 ip , geo_point geo_shape

134. _source 认 Elasticsearch JSON _source 样 _source 盘 压缩 这 终 为 结 —— 额 别 查询 _source 请 变 时 Elasticsearch 别 _source 过 get search 请 这样 查错误 为 ID 测 们 储 _source 还 盘 间 对 说 _source PUT /my_index { "mappings": { "my_type": { "_source": { "enabled": false } } } } 请 过 _source 请 GET /_search { "query": { "match_all": {}}, "_source": [ "title", "created" ] } 这 _source _source 储 值 选择 储 值 备 Lucene 户 储 选择 结 值 实 _source 储 Elasticsearch 单 设 储 经 _source 办 _source 过滤

135. _all 简单 们 绍 _all 值 query_string 时 认 _all 查询 _all 应 阶 较 还 终 结构时 查询 这 GET /_search { "match": { "_all": "john smith marketing" } } 应 发 变 _all _all 简单 过 查询 结 虑 则 长 较 title 语 较长 content 语显 间 这 _all 现 _all 过 PUT /my_index/_mapping/my_type { "my_type": { "_all": { "enabled": false } } } 过 include_in_all 选项 _all 认值 true 对 设 include_in_all 这 对 认 为 _all 查询 title , overview , summary tags 对 _all 认 include_in_all 选项 选 启 include_in_all PUT /my_index/my_type/_mapping { "my_type": { "include_in_all": false, "properties": { "title": { "type": "string", "include_in_all": true }, ... } } } 谨记 _all 仅仅 经过 string 认 值 这值 string 类 样 _all PUT /my_index/my_type/_mapping { "my_type": { "_all": { "analyzer": "whitespace" } } }

136.

137. ID 标识 组 _id ID _type 类 _index _uid _type _id 连 type#id 认 _uid _type _id _index 则 储 们 实 实 样查询 _id Elasticsearch _uid _id 虽 这 index store 设 这 _id 设 path 设 诉 Elasticsearch 哪 _id PUT /my_index { "mappings": { "my_type": { "_id": { "path": "doc_id" <1> }, "properties": { "doc_id": { "type": "string", "index": "not_analyzed" } } } } } <1> doc_id _id 时 POST /my_index/my_type { "doc_id": "123" } _id 值 doc_id { "_index": "my_index", "_type": "my_type", "_id": "123", <1> "_version": 1, "created": true } <1> _id 虽 这样 对 bulk 请 见 bulk 轻 响 处 请 节 仅 请 给哪

138.动态 Elasticsearch 时 过 动态 类 动 该 类 时这 为 时 许 哪 们 动 许 仅 仅 们 别 Elasticsearch 为 时 运 过 dynamic 设 这 为 选项 true 动 认 false strict 时 dynamic 设 对 object 对 dynamic 认设 为 strict 对 启 PUT /my_index { "mappings": { "my_type": { "dynamic": "strict", <1> "properties": { "title": { "type": "string"}, "stash": { "type": "object", "dynamic": true <2> } } } } } <1> 时 my_type 对 <2> stash 对 动创 过这 stash 对 PUT /my_index/my_type/1 { "title": "This doc adds a new field", "stash": { "new_field": "Success!" } } 顶层 样 则 败 PUT /my_index/my_type/1 { "title": "This throws a StrictDynamicMappingException", "new_field": "Fail!" } 备 dynamic 设 false _source _source 时 JSON

139. 义动态 运 时 启动态 虽 时动态 规则 显 运 们 过设 义这 规则 检测 Elasticsearch 时 检测这 识别 2014-01-01 这 为 date 类 则 为 string 类 时 这 规则 导 问题 长这样 { "note": "2014-01-01" } 设这 见 note 为 date 这样 { "note": "Logged out" } 这显 为时 这 经 为 类 这 发 检测 过 对 设 date_detection 为 false 闭 PUT /my_index { "mappings": { "my_type": { "date_detection": false } } } 这 终 string 类 date 动 Elasticsearch 为 规则 过 dynamic_date_formats 动态 dynamic_templates 设 过 类 应 这 mapping 这 match 义这 哪 顺 检测 启 们给 string 类 义两 es : _es 结 spanish en : english 们 es 为 en PUT /my_index { "mappings": { "my_type": { "dynamic_templates": [

140. { "es": { "match": "*_es", <1> "match_mapping_type": "string", "mapping": { "type": "string", "analyzer": "spanish" } }}, { "en": { "match": "*", <2> "match_mapping_type": "string", "mapping": { "type": "string", "analyzer": "english" } }} ] }}} <1> _es 结 . <2> 类 match_mapping_type 许 类 标 动态 规则检测 样 strong long match path_match 则 对 address.*.name 规则 这 样 { "address": { "city": { "name": "New York" } } } unmatch path_unmatch 规则 选项见 对

141. 认 类 设 _default_ 设 创 类 时 _default 类 _default_ 类 认设 类 这 们 _default_ 对 类 _all blog 启 PUT /my_index { "mappings": { "_default_": { "_all": { "enabled": false } }, "blog": { "_all": { "enabled": true } } } } _default_ 义 级别 动态

142.虽 给 类 给类 这样 变 简单 创 _source 处 经 Elasticsearch 库 这样 较 为 scan-scoll 读 过 bulk API 们 给 时间执 务 显 们 结 叠 务 过 时间 较 务 GET /old_index/_search?search_type=scan&scroll=1m { "query": { "range": { "date": { "gte": "2014-01-01", "lt": "2014-02-01" } } }, "size": 1000 } 继续 这 过 运 记 过滤 执

143. 别 时间 过 问题 须 应 另 别 这 问题 别 软连 给 API 别 带给 们 许 们 运 缝 换 另 给 类 last_three_months 给 创 视图 们 讨论 别 场 现 们 绍 们 时间 换 这 两 别 _alias 单 _aliases 这 们 设 应 my_index 实 my_index 实 别 实 my_index_v1 , my_index_v2 们创 my_index_v1 别 my_index PUT /my_index_v1 <1> PUT /my_index_v1/_alias/my_index <2> <1> 创 my_index_v1 <2> 别 my_index my_index_v1 检测这 别 哪 GET /*/_alias/my_index 哪 别 这 GET /my_index_v1/_alias/* 两 值 { "my_index_v1" : { "aliases" : { "my_index" : { } } } } 们 们 现 们 们创 my_index_v2 PUT /my_index_v2 { "mappings": { "my_type": { "properties": { "tags": { "type": "string", "index": "not_analyzed"

144. } } } } } 们 my_index_v1 my_index_v2 过 过 们认为 经 们 别 别 们 别 时 删 这 们 _aliases POST /_aliases { "actions": [ { "remove": { "index": "my_index_v1", "alias": "my_index" }}, { "add": { "index": "my_index_v2", "alias": "my_index" }} ] } 这样 应 时间 认为现 设计 经 应 产环 时 还 变 请 备 应 别 时 别 销 应

145.[[inside-a-shard]] == Inside a Shard In <>, we introduced the shard, and described((("shards"))) it as a low-level worker unit. But what exactly is a shard and how does it work? In this chapter, we answer these questions: Why is search near real-time? Why are document CRUD (create-read-update-delete) operations real-time? How does Elasticsearch ensure that the changes you make are durable, that they won't be lost if there is a power failure? Why does deleting documents not free up space immediately? What do the refresh , flush , and optimize APIs do, and when should you use them? The easiest way to understand how a shard functions today is to start with a history lesson. We will look at the problems that needed to be solved in order to provide a distributed durable data store with near real-time search and analytics. .Content Warning The information presented in this chapter is for your interest. You are not required to understand and remember all the detail in order to use Elasticsearch. Read this chapter to gain a taste for how things work, and to know where the information is in case you need to refer to it in the future, but don't be overwhelmed by the detail.

146.[[making-text-searchable]] === Making Text Searchable The first challenge that had to be solved was how to((("text", "making it searchable"))) make text searchable. Traditional databases store a single value per field, but this is insufficient for full-text search. Every word in a text field needs to be searchable, which means that the database needs to be able to index multiple values--words, in this case--in a single field. The data structure that best supports the multiple-values-per-field requirement is the inverted index, which((("inverted index"))) we introduced in <>. The inverted index contains a sorted list of all of the unique values, or terms, that occur in any document and, for each term, a list of all the documents that contain it. Term | Doc 1 | Doc 2 | Doc 3 | ... ------------------------------------ brown | X | | X | ... fox | X | X | X | ... quick | X | X | | ... the | X | | X | ... [NOTE] When discussing inverted indices, we talk about indexing documents because, historically, an inverted index was used to index whole unstructured text documents. A document in Elasticsearch is a structured JSON document with fields and values. In reality, every indexed field in a JSON document has its own inverted index. The inverted index may hold a lot more information than the list of documents that contain a particular term. It may store a count of the number of documents that contain each term, the number of times a term appears in a particular document, the order of terms in each document, the length of each document, the average length of all documents, and more. These statistics allow Elasticsearch to determine which terms are more important than others, and which documents are more important than others, as described in <>. The important thing to realize is that the inverted index needs to know about all documents in the collection in order for it to function as intended. In the early days of full-text search, one big inverted index was built for the entire document collection and written to disk. As soon as the new index was ready, it replaced the old index, and recent changes became searchable. [role="pagebreak-before"] ==== Immutability The inverted index that is written to disk is immutable: it doesn't change.((("inverted index", "immutability"))) Ever. This immutability has important benefits: There is no need for locking. If you never have to update the index, you never have to worry about multiple processes trying to make changes at the same time. Once the index has been read into the kernel's filesystem cache, it stays there, because it never changes. As long as there is enough space in the filesystem cache, most reads will come from memory instead of having to hit disk. This provides a big performance boost. Any other caches (like the filter cache) remain valid for the life of the index. They don't need to be rebuilt every time the data changes, because the data doesn't change. Writing a single large inverted index allows the data to be compressed, reducing costly disk I/O and the amount of RAM needed to cache the index. Of course, an immutable index has its downsides too, primarily the fact that it is immutable! You can't change it. If you want

147.to make new documents searchable, you have to rebuild the entire index. This places a significant limitation either on the amount of data that an index can contain, or the frequency with which the index can be updated.

148.[[dynamic-indices]] === Dynamically Updatable Indices The next problem that needed to be ((("indices", "dynamically updatable")))solved was how to make an inverted index updatable without losing the benefits of immutability? The answer turned out to be: use more than one index. Instead of rewriting the whole inverted index, add new supplementary indices to reflect more-recent changes. Each inverted index can be queried in turn--starting with the oldest--and the results combined. Lucene, the Java libraries on which Elasticsearch is based, introduced the concept of per-segment search. ((("per-segment search")))((("segments")))((("indices", "in Lucene"))) A segment is an inverted index in its own right, but now the word index in Lucene came to mean a collection of segments plus a commit point—a file((("commit point"))) that lists all known segments, as depicted in <>. New documents are first added to an in-memory indexing buffer, as shown in <>, before being written to an on-disk segment, as in <> [[img-index-segments]] .A Lucene index with a commit point and three segments image::images/elas_1101.png["A Lucene index with a commit point and three segments"] .Index Versus Shard To add to the confusion, a Lucene index is what we call a shard in Elasticsearch, while an index in Elasticsearch((("indices", "in Elasticsearch")))((("shards", "indices versus"))) is a collection of shards. When Elasticsearch searches an index, it sends the query out to a copy of every shard (Lucene index) that belongs to the index, and then reduces the per-shards results to a global result set, as described in <>. A per-segment search works as follows: 1. New documents are collected in an in-memory indexing buffer. See <>. 2. Every so often, the buffer is commited: A new segment--a supplementary inverted index--is written to disk. A new commit point is written to disk, which includes the name of the new segment. ** The disk is fsync'ed—all writes waiting in the filesystem cache are flushed to disk, to ensure that they have been physically written. 1. The new segment is opened, making the documents it contains visible to search. 2. The in-memory buffer is cleared, and is ready to accept new documents. [[img-memory-buffer]] .A Lucene index with new documents in the in-memory buffer, ready to commit image::images/elas_1102.png["A Lucene index with new documents in the in-memory buffer, ready to commit"] [[img-post-commit]] .After a commit, a new segment is added to the commit point and the buffer is cleared image::images/elas_1103.png["After a commit, a new segment is added to the index and the buffer is cleared"] When a query is issued, all known segments are queried in turn. Term statistics are aggregated across all segments to ensure that the relevance of each term and each document is calculated accurately. In this way, new documents can be added to the index relatively cheaply. [[deletes-and-updates]] ==== Deletes and Updates Segments are immutable, so documents cannot be removed from older segments, nor can older segments be updated to reflect a newer version of a document. Instead, every ((("deleted documents")))commit point includes a .del file that lists which documents in which segments have been deleted. When a document is `deleted,'' it is actually just _marked_ as deleted in the .del` file. A document that has been marked as deleted can still match a query, but it is removed from the results list before the final query results are returned. Document updates work in a similar way: when a document is updated, the old version of the document is marked as

149.deleted, and the new version of the document is indexed in a new segment. Perhaps both versions of the document will match a query, but the older deleted version is removed before the query results are returned. In <>, we show how deleted documents are purged from the filesystem.

150.[[near-real-time]] === Near Real-Time Search With the development of per-segment search, the ((("searching", "near real-time search")))delay between indexing a document and making it visible to search dropped dramatically. New documents could be made searchable within minutes, but that still isn't fast enough. The bottleneck is the disk. ((("committing segments to disk")))((("fsync")))((("segments", "committing to disk"))) Commiting a new segment to disk requires an http://en.wikipedia.org/wiki/Fsync[`fsync`] to ensure that the segment is physically written to disk and that data will not be lost if there is a power failure. But an fsync is costly; it cannot be performed every time a document is indexed without a big performance hit. What was needed was a more lightweight way to make new documents visible to search, which meant removing fsync from the equation. Sitting between Elasticsearch and the disk is the filesystem cache.((("filesystem cache"))) As before, documents in the in- memory indexing buffer (<>) are written to a new segment (<>). But the new segment is written to the filesystem cache first- -which is cheap--and only later is it flushed to disk--which is expensive. But once a file is in the cache, it can be opened and read, just like any other file. [[img-pre-refresh]] .A Lucene index with new documents in the in-memory buffer image::images/elas_1104.png["A Lucene index with new documents in the in-memory buffer"] Lucene allows new segments to be written and opened--making the documents they contain visible to search--without performing a full commit. This is a much lighter process than a commit, and can be done frequently without ruining performance. [[img-post-refresh]] .The buffer contents have been written to a segment, which is searchable, but is not yet commited image::images/elas_1105.png["The buffer contents have been written to a segment, which is searchable, but is not yet commited"] [[refresh-api]] ==== refresh API In Elasticsearch, this lightweight process of writing and opening a new segment is called a refresh.((("shards", "refreshes"))) ((("refresh API"))) By default, every shard is refreshed automatically once every second. This is why we say that Elasticsearch has near real-time search: document changes are not visible to search immediately, but will become visible within 1 second. This can be confusing for new users: they index a document and try to search for it, and it just isn't there. The way around this is to perform a manual refresh, with the refresh API: [source,json] POST /_refresh <1> POST /blogs/_refresh <2> <1> Refresh all indices. <2> Refresh just the blogs index. [TIP] While a refresh is much lighter than a commit, it still has a performance cost.((("indices", "refresh_interval"))) A manual refresh can be useful when writing tests, but don't do a manual refresh every time you index a document in production; it will hurt your performance. Instead, your application needs to be aware of the near

151.real-time nature of Elasticsearch and make allowances for it. Not all use cases require a refresh every second. Perhaps you are using Elasticsearch to index millions of log files, and you would prefer to optimize for index speed rather than near real-time search. You can reduce the frequency of refreshes on a per-index basis by ((("refresh_interval setting")))setting the refresh_interval : [source,json] PUT /my_logs { "settings": { "refresh_interval": "30s" <1> } } <1> Refresh the my_logs index every 30 seconds. The refresh_interval can be updated dynamically on an existing index. You can turn off automatic refreshes while you are building a big new index, and then turn them back on when you start using the index in production: [source,json] POST /my_logs/_settings { "refresh_interval": -1 } <1> POST /my_logs/_settings { "refresh_interval": "1s" } <2> <1> Disable automatic refreshes. <2> Refresh automatically every second. CAUTION: The refresh_interval expects a duration such as 1s (1 second) or 2m (2 minutes). An absolute number like 1 means 1 millisecond--a sure way to bring your cluster to its knees.

152.[[translog]] === Making Changes Persistent Without an fsync to flush data in the filesystem cache to disk, we cannot be sure that the data will still ((("persistent changes, making")))((("changes, persisting")))be there after a power failure, or even after exiting the application normally. For Elasticsearch to be reliable, it needs to ensure that changes are persisted to disk. In <>, we said that a full commit flushes segments to disk and writes a commit point, which lists all known segments. ((("commit point"))) Elasticsearch uses this commit point during startup or when reopening an index to decide which segments belong to the current shard. While we refresh once every second to achieve near real-time search, we still need to do full commits regularly to make sure that we can recover from failure. But what about the document changes that happen between commits? We don't want to lose those either. Elasticsearch added a translog, or transaction log,((("translog (transaction log)"))) which records every operation in Elasticsearch as it happens. With the translog, the process now looks like this: 1. When a document is indexed, it is added to the in-memory buffer and appended to the translog, as shown in <>. + [[img-xlog-pre-refresh]] .New documents are added to the in-memory buffer and appended to the transaction log image::images/elas_1106.png["New documents are added to the in-memory buffer and appended to the transaction log"] 2. The refresh leaves the shard in the state depicted in <>. Once every second, the shard is refreshed: + The docs in the in-memory buffer are written to a new segment, without an fsync . The segment is opened to make it visible to search. ** The in-memory buffer is cleared. [[img-xlog-post-refresh]] .After a refresh, the buffer is cleared but the transaction log is not image::images/elas_1107.png["After a refresh, the buffer is cleared but the transaction log is not"] 1. This process continues with more documents being added to the in-memory buffer and appended to the transaction log (see <>). + [[img-xlog-pre-flush]] .The transaction log keeps accumulating documents image::images/elas_1108.png["The transaction log keeps accumulating documents"] 1. Every so often--such as when the translog is getting too big--the index is flushed; a new translog is created, and a full commit is performed (see <>): + Any docs in the in-memory buffer are written to a new segment. The buffer is cleared. A commit point is written to disk. The filesystem cache is flushed with an fsync . ** The old translog is deleted. -- The translog provides a persistent record of all operations that have not yet been flushed to disk. When starting up, Elasticsearch will use the last commit point to recover known segments from disk, and will then replay all operations in the translog to add the changes that happened after the last commit. The translog is also used to provide real-time CRUD. When you try to retrieve, update, or delete a document by ID, it first

153.checks the translog for any recent changes before trying to retrieve the document from the relevant segment. This means that it always has access to the latest known version of the document, in real-time. [[img-xlog-post-flush]] .After a flush, the segments are fully commited and the transaction log is cleared image::images/elas_1109.png["After a flush, the segments are fully commited and the transaction log is cleared"] [[flush-api]] ==== flush API The action of performing a commit and truncating the translog is known in Elasticsearch as a flush. ((("flushes"))) Shards are flushed automatically every 30 minutes, or when the translog becomes too big. See the http://bit.ly/1E3HKbD[`translog` documentation] for settings that can be used((("translog (transaction log)", "flushes and"))) to control these thresholds: The http://bit.ly/1ICgxiU[`flush` API] can ((("indices", "flushing")))((("flush API")))be used to perform a manual flush: [source,json] POST /blogs/_flush <1> POST /_flush?wait_for_ongoing <2> <1> Flush the blogs index. <2> Flush all indices and wait until all flushes have completed before returning. You seldom need to issue a manual flush yourself; usually, automatic flushing is all that is required. That said, it is beneficial to <> your indices before restarting a node or closing an index. When Elasticsearch tries to recover or reopen an index, it has to replay all of the operations in the translog, so the shorter the log, the faster the recovery. .How Safe Is the Translog? The purpose of the translog is to ensure that operations are not lost. This begs the question: how safe((("translog (transaction log)", "safety of"))) is the translog? Writes to a file will not survive a reboot until the file has been +fsync+'ed to disk. By default, the translog is +fsync+'ed every 5 seconds. Potentially, we could lose 5 seconds worth of data--if the translog were the only mechanism that we had for dealing with failure. Fortunately, the translog is only part of a much bigger system. Remember that an indexing request is considered successful only after it has completed on both the primary shard and all replica shards. Even if the node holding the primary shard were to suffer catastrophic failure, it would be unlikely to affect the nodes holding the replica shards at the same time. While we could force the translog to fsync more frequently (at the cost of indexing performance), it is unlikely to provide more reliability.

154.[[merge-process]] === Segment Merging With the automatic refresh process creating a new segment((("segments", "merging"))) every second, it doesn't take long for the number of segments to explode. Having too many segments is a problem. Each segment consumes file handles, memory, and CPU cycles. More important, every search request has to check every segment in turn; the more segments there are, the slower the search will be. Elasticsearch solves this problem by merging segments in the background.((("merging segments"))) Small segments are merged into bigger segments, which, in turn, are merged into even bigger segments. This is the moment when those old deleted documents((("deleted documents", "purging of"))) are purged from the filesystem. Deleted documents (or old versions of updated documents) are not copied over to the new bigger segment. There is nothing you need to do to enable merging. It happens automatically while you are indexing and searching. The process works like as depicted in <>: 1. While indexing, the refresh process creates new segments and opens them for search. 2. The merge process selects a few segments of similar size and merges them into a new bigger segment in the background. This does not interrupt indexing and searching. + [[img-merge]] .Two commited segments and one uncommited segment in the process of being merged into a bigger segment image::images/elas_1110.png["Two commited segments and one uncommited segment in the process of being merged into a bigger segment"] 3. <> illustrates activity as the merge completes: + The new segment is flushed to disk. A new commit point is written that includes the new segment and excludes the old, smaller segments. The new segment is opened for search. The old segments are deleted. [[img-post-merge]] .Once merging has finished, the old segments are deleted image::images/elas_1111.png["Once merging has finished, the old segments are deleted"] The merging of big segments can use a lot of I/O and CPU, which can hurt search performance if left unchecked. By default, Elasticsearch throttles the merge process so that search still has enough resources available to perform well. TIP: See <> for advice about tuning merging for your use case. [[optimize-api]] ==== optimize API The optimize API is best ((("merging segments", "optimize API and")))((("optimize API")))((("segments", "merging", "optimize API")))described as the forced merge API. It forces a shard to be merged down to the number of segments specified in the max_num_segments parameter. The intention is to reduce the number of segments (usually to one) in order to speed up search performance. WARNING: The optimize API should not be used on a dynamic index--an index that is being actively updated. The background merge process does a very good job, and optimizing will hinder the process. Don't interfere! In certain specific circumstances, the optimize API can be beneficial. The typical use case is for logging, where logs are stored in an index per day, week, or month. Older indices are essentially read-only; they are unlikely to change.

155.In this case, it can be useful to optimize the shards of an old index down to a single segment each; it will use fewer resources and searches will be quicker: [source,json] POST /logstash-2014-10/_optimize?max_num_segments=1 <1> <1> Merges each shard in the index down to a single segment [WARNING] Be aware that merges triggered by the optimize API are not throttled at all. They can consume all of the I/O on your nodes, leaving nothing for search and potentially making your cluster unresponsive. If you plan on optimizing an index, you should use shard allocation (see <>) to first move the index to a node where it is safe to run.

156.结构 结构 查询 结构 时间 结构 们 给 执 逻辑 较 围 两 值哪 结构 笔 颜 红 绿 蓝 标签 电 务产 统 码 UPCs 严 标识 过结构 查询结 终 应该 结构 简单 这 须 义 逻辑 围 围 —— 类 对 结构 值 须 这

157.查 值 对 值 过滤 过滤 们 们 计 过 计 阶 缓 们 讨论过滤 优势 过滤 缓 现 请 记 过滤 term 过滤 们 绍 term 过滤 为 经 这 过滤 处 尔值 们 产 两 price productID POST /my_store/products/_bulk { "index": { "_id": 1 }} { "price" : 10, "productID" : "XHDK-A-1293-#fJ3" } { "index": { "_id": 2 }} { "price" : 20, "productID" : "KDKE-B-9947-#kL5" } { "index": { "_id": 3 }} { "price" : 30, "productID" : "JODL-X-1937-#pV7" } { "index": { "_id": 4 }} { "price" : 30, "productID" : "QQPX-R-3956-#aD8" } 们 标 产 库 SQL 现这 查询 较 这样 SELECT document FROM products WHERE price = 20 Elasticsearch DSL 们 term 过滤 实现 样 term 过滤 查 们设 值 term 过滤 简单 们 查 值 { "term" : { "price" : 20 } } term 过滤 查询 DSL 绍 样 API 查询语 过滤 为 term 过滤 们 过滤查询语 GET /my_store/products/_search { "query" : { "filtered" : { <1> "query" : { "match_all" : {} <2> }, "filter" : { "term" : { <3> "price" : 20 } } } } } <1> filtered 查询 时 query filter <2> match_all 这 认 为 们 query <3> 这 们 见过 term 过滤 filter

158.执 预 结 2 为 2 20 "hits" : [ { "_index" : "my_store", "_type" : "products", "_id" : "2", "_score" : 1.0, <1> "_source" : { "price" : 20, "productID" : "KDKE-B-9947-#kL5" } } ] <1> 过滤 执 计 计 值 match_all 查询产 视 结 值 1 term 过滤 们 头 term 过滤 样轻 让 们 过 UPC 标识码 产 过 SQL 实现 们 查询 SELECT product FROM products WHERE productID = "XHDK-A-1293-#fJ3" 转 查询 DSL 们 term 过滤 构 类 查询 GET /my_store/products/_search { "query" : { "filtered" : { "filter" : { "term" : { "productID" : "XHDK-A-1293-#fJ3" } } } } } 们 结 值 为 问题 term 查询 们 analyze API 们 UPC GET /my_store/_analyze?field=productID XHDK-A-1293-#fJ3 { "tokens" : [ { "token" : "xhdk", "start_offset" : 0, "end_offset" : 4, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "a", "start_offset" : 5, "end_offset" : 6, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "1293", "start_offset" : 7, "end_offset" : 11,

159. "type" : "<NUM>", "position" : 3 }, { "token" : "fj3", "start_offset" : 13, "end_offset" : 16, "type" : "<ALPHANUM>", "position" : 4 } ] } 这 们 UPC 转为 们 连 # 们 XHDK-A-1293-#fJ3 查 时 结 为这 们 显 处 标识码 举值时 这 们 结 为 这 发 们 过设 这 为 not_analyzed 诉 Elasticsearch 值 们 义 见过 为 实现 标 们 删 为 错误 创 DELETE /my_store <1> PUT /my_store <2> { "mappings" : { "products" : { "properties" : { "productID" : { "type" : "string", "index" : "not_analyzed" <3> } } } } } <1> 须 删 为 们 经 <2> 删 们 义 创 <3> 这 们 productID 现 们 继续 POST /my_store/products/_bulk { "index": { "_id": 1 }} { "price" : 10, "productID" : "XHDK-A-1293-#fJ3" } { "index": { "_id": 2 }} { "price" : 20, "productID" : "KDKE-B-9947-#kL5" } { "index": { "_id": 3 }} { "price" : 30, "productID" : "JODL-X-1937-#pV7" } { "index": { "_id": 4 }} { "price" : 30, "productID" : "QQPX-R-3956-#aD8" } 现 们 term 过滤 预 让 们 试 查询 过滤

160. GET /my_store/products/_search { "query" : { "filtered" : { "filter" : { "term" : { "productID" : "XHDK-A-1293-#fJ3" } } } } } productID 经过 term 过滤 执 这 查询 值 1 过滤 Elasticsearch 过 执 过滤 1. 查 term 过滤 查 词 XHDK-A-1293-#fJ3 词 这 1 们 词 2. 创 节 过滤 创 节 —— 1 0组 组 —— 哪 这 词 1 节 们 节 [1,0,0,0] 3. 缓 节 节 储 们 过 骤1 2 这 让过滤变 执 filtered 查询时 filter query 执 结 节 传给 query 过 经 这 过滤 查询

161.组 过滤 两 单 过滤 现实 过滤 值 Elasticsearch 达这 SQL 吗 SELECT product FROM products WHERE (price = 20 OR productID = "XHDK-A-1293-#fJ3") AND (price != 30) 这 bool 过滤 这 过滤 为 组 过滤 们结 尔组 尔过滤 bool 过滤 组 { "bool" : { "must" : [], "should" : [], "must_not" : [], } } must 须 AND must_not 须 NOT should OR 这样 过滤 们 bool 过滤 bool 过滤 选 must 过滤 为 SQL 们 两 term 过滤 bool 过滤 should 另 处 NOT GET /my_store/products/_search { "query" : { "filtered" : { <1> "filter" : { "bool" : { "should" : [ { "term" : {"price" : 20}}, <2> { "term" : {"productID" : "XHDK-A-1293-#fJ3"}} <2> ], "must_not" : { "term" : {"price" : 30} <3> } } } } } } <1> 们 filtered 查询 <2> 这两 term 过滤 bool 过滤 节 为 们 should 们

162.<3> 产 值 30 动 为 must_not 们 结 两 结 别满 bool 过滤 "hits" : [ { "_id" : "1", "_score" : 1.0, "_source" : { "price" : 10, "productID" : "XHDK-A-1293-#fJ3" <1> } }, { "_id" : "2", "_score" : 1.0, "_source" : { "price" : 20, <2> "productID" : "KDKE-B-9947-#kL5" } } ] <1> term 过滤 productID = "XHDK-A-1293-#fJ3" <2> term 过滤 price = 20 尔过滤 虽 bool 组 过滤 过滤 过滤 这 bool 过滤 bool 过滤 让 实现 杂 尔逻辑 给 SQL 语 SELECT document FROM products WHERE productID = "KDKE-B-9947-#kL5" OR ( productID = "JODL-X-1937-#pV7" AND price = 30 ) 们 译 对 bool 过滤 GET /my_store/products/_search { "query" : { "filtered" : { "filter" : { "bool" : { "should" : [ { "term" : {"productID" : "KDKE-B-9947-#kL5"}}, <1> { "bool" : { <1> "must" : [ { "term" : {"productID" : "JODL-X-1937-#pV7"}}, <2> { "term" : {"price" : 30}} <2> ] }} ] } } } } } <1> 为 term bool should 级 过滤 <2> must 两 级 term 们俩

163.结 两 别 should "hits" : [ { "_id" : "2", "_score" : 1.0, "_source" : { "price" : 20, "productID" : "KDKE-B-9947-#kL5" <1> } }, { "_id" : "3", "_score" : 1.0, "_source" : { "price" : 30, <2> "productID" : "JODL-X-1937-#pV7" <2> } } ] <1> productID bool term 过滤 <2> 这两 bool term 过滤 这 简单 该 样 尔过滤 构 杂 逻辑

164.查询 值 term 过滤 查询单 值时 经 值 寻 20 30 该 term 过滤 terms 过滤 terms 过滤 term 过滤 term 们现 组 值 单 { "terms" : { "price" : [20, 30] } } term 过滤 样 们 filtered 查询 GET /my_store/products/_search { "query" : { "filtered" : { "filter" : { "terms" : { <1> "price" : [20, 30] } } } } } <1> 这 terms 过滤 filtered 查询 这 查询 "hits" : [ { "_id" : "2", "_score" : 1.0, "_source" : { "price" : 20, "productID" : "KDKE-B-9947-#kL5" } }, { "_id" : "3", "_score" : 1.0, "_source" : { "price" : 30, "productID" : "JODL-X-1937-#pV7" } }, { "_id": "4", "_score": 1.0, "_source": { "price": 30, "productID": "QQPX-R-3956-#aD8" } } ]

165. term terms 这 这 term 过滤 { "term" : { "tags" : "search" } } 两 { "tags" : ["search"] } { "tags" : ["search", "open_source"] } <1> <1> 虽 这 search 还 语 还 顾 term 过滤 检查 语 组 节 们简单 们 Token DocIDs open_source 2 search 1 , 2 执 term 过滤 查询 search 时 值 ID 见 1 2 search 们 为结 让 变 难 请 语 这 语 ID 扫 值 见 这 变 销 term terms 须 须 这 为 过 另 实现 这 值 两 们现 记录标签 { "tags" : ["search"], "tag_count" : 1 } { "tags" : ["search", "open_source"], "tag_count" : 2 } 标签 构 bool 过滤 语 GET /my_index/my_type/_search { "query": { "filtered" : { "filter" : { "bool" : { "must" : [ { "term" : { "tags" : "search" } }, <1> { "term" : { "tag_count" : 1 } } <2> ] } } } } } <1> search 语 <2> 标签 这 search 标签 search 标签

166.

167. 围 们 现 过 现实 过 围 过滤 为 20 40 产 SQL 语 围 SELECT document FROM products WHERE price BETWEEN 20 AND 40 Elasticsearch range 过滤 让 围过滤 "range" : { "price" : { "gt" : 20, "lt" : 40 } } range 过滤 围 过 选项 gt : > lt : < gte : >= lte : <= 围过滤 GET /my_store/products/_search { "query" : { "filtered" : { "filter" : { "range" : { "price" : { "gte" : 20, "lt" : 40 } } } } } } 设 围 边 "range" : { "price" : { "gt" : 20 } } 围 range 过滤 "range" : { "timestamp" : {

168. "gt" : "2014-01-01 00:00:00", "lt" : "2014-01-07 00:00:00" } } 时 range 过滤 们 时 "range" : { "timestamp" : { "gt" : "now-1h" } } 这 过滤 终 时间 时间减 1 时 让这 过滤 样 过 计 实际 仅仅 now 样 竖线 || 达 "range" : { "timestamp" : { "gt" : "2014-01-01 00:00:00", "lt" : "2014-01-01 00:00:00||+1M" <1> } } <1> 2014 1 1 计 历 详细 这 围 range 过滤 围 顺 计 这 值 顺 5, 50, 6, B, C, a, ab, abb, abc, b 语 顺 为 围 这 顺 们 让 围 a b 们 类 range 过滤 语 "range" : { "title" : { "gte" : "a", "lt" : "b" } } 让 们 计 围时 对 说 这样 为 执 围 Elasticsearch 这 围 语执 term 这 围 围 较 语 较 语

169.处 Null 值 们 值 tags 标签 标签 值 储 这 问题 为 储 让 们 节 Token DocIDs open_source 2 search 1 , 2 储 结构 们 简单 结构 达 质 说 null [] 组 [null] 们 显 这 简单 经 值 组 为 应对这 Elasticsearch 处 值 exists 过滤 exists 过滤 这 过滤 这 让 们 标签 举 POST /my_index/posts/_bulk { "index": { "_id": "1" }} { "tags" : ["search"] } <1> { "index": { "_id": "2" }} { "tags" : ["search", "open_source"] } <2> { "index": { "_id": "3" }} { "other_field" : "some data" } <3> { "index": { "_id": "4" }} { "tags" : null } <4> { "index": { "_id": "5" }} { "tags" : ["search", null] } <5> <1> tags 值 <2> tags 两 值 <3> tags <4> tags 设为 null <5> tags 值 null 结 们 tags 这样 Token DocIDs open_source 2 search 1 , 2 , 5 们 标 设 标签 们 这 标签 SQL 语 们 IS NOT NULL 查询 SELECT tags FROM posts

170. WHERE tags IS NOT NULL Elasticsearch 们 exists 过滤 GET /my_index/posts/_search { "query" : { "filtered" : { "filter" : { "exists" : { "field" : "tags" } } } } } 查询 "hits" : [ { "_id" : "1", "_score" : 1.0, "_source" : { "tags" : ["search"] } }, { "_id" : "5", "_score" : 1.0, "_source" : { "tags" : ["search", null] } <1> }, { "_id" : "2", "_score" : 1.0, "_source" : { "tags" : ["search", "open source"] } } ] <1> 5虽 null 值 这 为 值 标签 null 对这 过滤 响 结 tags 值 3 4 missing 过滤 missing 过滤 质 exists 义词 值 这 SQL 样 SELECT tags FROM posts WHERE tags IS NULL 让 们 missing 过滤 exists GET /my_index/posts/_search { "query" : { "filtered" : { "filter": { "missing" : { "field" : "tags" } } } } } 们 两 标签

171. "hits" : [ { "_id" : "3", "_score" : 1.0, "_source" : { "other_field" : "some data" } }, { "_id" : "4", "_score" : 1.0, "_source" : { "tags" : null } } ] 时 null null 时 值 还 设 为 null 见 认 为 这 运 们 null 值 们选择 尔值 时 设 null_value 处 null 值 值 选 null_value 时 类 date 类 null_value 这 值 实值 null 值 对 exists/missing exists missing 过滤 样 联对 仅仅 类 { "name" : { "first" : "John", "last" : "Smith" } } 检查 name.first name.last 检查 name 们 对 转 键值结构 { "name.first" : "John", "name.last" : "Smith" } 们 exists missing 检测 name 这 这样 过滤 { "exists" : { "field" : "name" } } 实际 这样执 { "bool": { "should": [ { "exists": { "field": { "name.first" }}}, { "exists": { "field": { "name.last" }}} ]

172. } } 样这 first last 为 name

173. 缓 过滤 节 们简单 过过滤 计 们 节 哪 这 过滤 Elasticsearch 动缓 这 节 缓 过滤时 这 节 运 过滤 缓 节 “聪 ” 们 这 节 计 缓 过滤 过滤 统 样 实时 缓 过 时间 过滤缓 过滤 计 缓 们 哪 两 查询 过滤 则 节 样 查询 处 样 过滤 节 计 让 们 查 邮 读 过 标记为 "bool": { "should": [ { "bool": { "must": [ { "term": { "folder": "inbox" }}, <1> { "term": { "read": false }} ] }}, { "bool": { "must_not": { "term": { "folder": "inbox" } <1> }, "must": { "term": { "important": true } } }} ] } <1> 这两 过滤 节 虽 must 另 must_not 这两 这 节 执 时计 为缓 另 执 这 查询时 过滤 经 缓 两 缓 节 这 查询 DSL 组 紧 动过滤 查询 处 过滤 简单 这 仅仅 发 —— 对 缓 处 过滤 term 缓 bool 这类 组 过滤 则 缓 过滤 盘 检 缓 们 义 另 说 组 过滤 节逻辑 组 们 节 结 计 们 过滤 认 缓 为 们这样 义 过滤 过滤 结 缓 为 义对 Elasticsearch 说

174.Geo 过滤 过滤 们 geoloc 详细 绍 过滤 户 结 为 户 geo 过滤 缓 们 义 围 now 围 "now-1h" 结 值 这 过滤 执 时 now 值 过滤 认缓 now 时 now/d 缓 认 启 时 认 缓 测试 杂 bool 达 查询 date 过滤 缓 过 _cache 标记 过滤 认缓 { "range" : { "timestamp" : { "gt" : "2014-01-02 16:15:14" <1> }, "_cache": false <2> } } <1> 们 这 时间 <2> 这 过滤 缓 节 说 哪 时 认缓 义

175.过滤顺 bool 过滤 顺 对 响 详细 过滤 应该 过滤 A 1000 B 100 B A 缓 过滤 们 缓 过滤 们 们 对 时 兴 GET /logs/2014-01/_search { "query" : { "filtered" : { "filter" : { "range" : { "timestamp" : { "gt" : "now-1h" } } } } } } 这 过滤 缓 为 now 这 值 变 这 们 执 这 查询时 检 测 们 过组 缓 过滤 让这变 们 时间 过滤 这 "bool": { "must": [ { "range" : { "timestamp" : { "gt" : "now-1h/d" <1> } }}, { "range" : { "timestamp" : { "gt" : "now-1h" <2> } }} ] } <1> 这 过滤 缓 为 now <2> 这 过滤 缓 为 对 now now-1h/d 这 结 节 缓 为 now 值 变时 执 now-1h 缓 为 now 时间 过滤 过滤 检测 这 实现 为 时 们 别 组 时 还 检测 仅仅

176.[[full-text-search]] == Full-Text Search Now that we have covered the simple case of searching for structured data, it is time to ((("full text search")))explore full- text search: how to search within full-text fields in order to find the most relevant documents. The two most important aspects of ((("relevance")))full-text search are as follows: Relevance:: The ability to rank results by how relevant they are to the given query, whether relevance is calculated using TF/IDF (see <<relevance-intro>>), proximity to a geolocation, fuzzy similarity, or some other algorithm. Analysis:: The process of converting a block of text into distinct, normalized tokens (see <<analysis-intro>>) in order to (a) create an inverted index and (b) query the inverted index. As soon as we talk ((("analysis")))about either relevance or analysis, we are in the territory of queries, rather than filters. [[term-vs-full-text]] === Term-Based Versus Full-Text While all queries perform some sort of relevance calculation, not all queries have an analysis phase.((("full text search", "term-based versus")))((("term-based queries"))) Besides specialized queries like the bool or function_score queries, which don't operate on text at all, textual queries can be broken down into two families: Term-based queries:: + Queries like the term or fuzzy queries are low-level queries that have no analysis phase.((("fuzzy queries"))) They operate on a single term. A term query for the term Foo looks for that exact term in the inverted index and calculates the TF/IDF relevance _score for each document that contains the term. It is important to remember that the term query looks in the inverted index for the exact term only; it won't match any variants like foo or FOO . It doesn't matter how the term came to be in the index, just that it is. If you were to index ["Foo","Bar"] into an exact value not_analyzed field, or Foo Bar into an analyzed field with the whitespace analyzer, both would result in having the two terms Foo and Bar in the inverted index. -- Full-text queries:: + Queries like the match or query_string queries are high-level queries that understand the mapping of a field: If you use them to query a date or integer field, they will treat the query string as a date or integer, respectively. If you query an exact value ( not_analyzed ) string field,((("not_analyzed string fields", "match or query-string queries on"))) they will treat the whole query string as a single term. But if you query a full-text ( analyzed ) field,((("analyzed fields", "match or query-string queries on"))) they will first pass the query string through the appropriate analyzer to produce the list of terms to be queried.

177.Once the query has assembled a list of terms, it executes the appropriate low-level query for each of these terms, and then combines their results to produce the final relevance score for each document. We will discuss this process in more detail in the following chapters. You seldom need to use the term-based queries directly. Usually you want to query full text, not individual terms, and this is easier to do with the high-level full-text queries (which end up using term-based queries internally). [NOTE] If you do find yourself wanting to use a query on an exact value not_analyzed field, ((("exact values", "not_analyzed fields, querying")))think about whether you really want a query or a filter. Single-term queries usually represent binary yes/no questions and are almost always better expressed as a ((("filters", "single-term queries better expressed as")))filter, so that they can benefit from <>: [source,js] GET /_search { "query": { "filtered": { "filter": { "term": { "gender": "female" } } } } } ====

178.[[match-query]] === The match Query The match query is the go-to query--the first query that you should reach for whenever you need to query any field. ((("match query")))((("full text search", "match query"))) It is a high-level full-text query, meaning that it knows how to deal with both full-text fields and exact-value fields. That said, the main use case for the match query is for full-text search. So let's take a look at how full-text search works with a simple example. [[match-test-data]] ==== Index Some Data First, we'll create a new index and index some((("full text search", "match query", "indexing data"))) documents using the <>: [source,js] DELETE /my_index <1> PUT /my_index { "settings": { "number_of_shards": 1 }} <2> POST /my_index/my_type/_bulk { "index": { "_id": 1 }} { "title": "The quick brown fox" } { "index": { "_id": 2 }} { "title": "The quick brown fox jumps over the lazy dog" } { "index": { "_id": 3 }} { "title": "The quick brown fox jumps over the quick dog" } { "index": { "_id": 4 }} { "title": "Brown fox brown dog" } // SENSE: 100_Full_Text_Search/05_Match_query.json <1> Delete the index in case it already exists. <2> Later, in <>, we explain why we created this index with only one primary shard. ==== A Single-Word Query Our first example explains what((("full text search", "match query", "single word query")))((("match query", "single word query"))) happens when we use the match query to search within a full-text field for a single word: [source,js] GET /my_index/my_type/_search { "query": { "match": { "title": "QUICK!" } } } // SENSE: 100_Full_Text_Search/05_Match_query.json Elasticsearch executes the preceding match query((("analysis", "in single term match query"))) as follows: 1. Check the field type. + The title field is a full-text ( analyzed ) string field, which means that the query string should be analyzed too. 2. Analyze the query string. + The query string QUICK! is passed through the standard analyzer, which results in the single term quick . Because we have a just a single term, the match query can be executed as a single low-level term query. 3. Find matching docs. + The term query looks up quick in the inverted index and retrieves the list of documents that

179. contain that term--in this case, documents 1, 2, and 3. 4. Score each doc. + The term query calculates the relevance _score for each matching document, by combining the((("relevance scores", "calculating for single term match query results"))) term frequency (how often quick appears in the title field of each document), with the inverse document frequency (how often quick appears in the title field in all documents in the index), and the length of each field (shorter fields are considered more relevant). See <>. This process gives us the following (abbreviated) results: [source,js] "hits": [ { "_id": "1", "_score": 0.5, <1> "_source": { "title": "The quick brown fox" } }, { "_id": "3", "_score": 0.44194174, <2> "_source": { "title": "The quick brown fox jumps over the quick dog" } }, { "_id": "2", "_score": 0.3125, <2> "_source": { "title": "The quick brown fox jumps over the lazy dog" } } ] <1> Document 1 is most relevant because its title field is short, which means that quick represents a large portion of its content. <2> Document 3 is more relevant than document 2 because quick appears twice.

180.[[match-multi-word]] === Multiword Queries If we could search for only one word at a time, full-text search would be pretty inflexible. Fortunately, the match query((("full text search", "multi-word queries")))((("match query", "multi-word query"))) makes multiword queries just as simple: [source,js] GET /my_index/my_type/_search { "query": { "match": { "title": "BROWN DOG!" } } } // SENSE: 100_Full_Text_Search/05_Match_query.json The preceding query returns all four documents in the results list: [source,js] { "hits": [ { "_id": "4", "_score": 0.73185337, <1> "_source": { "title": "Brown fox brown dog" } }, { "_id": "2", "_score": 0.47486103, <2> "_source": { "title": "The quick brown fox jumps over the lazy dog" } }, { "_id": "3", "_score": 0.47486103, <2> "_source": { "title": "The quick brown fox jumps over the quick dog" } }, { "_id": "1", "_score": 0.11914785, <3> "_source": { "title": "The quick brown fox" } } ] } <1> Document 4 is the most relevant because it contains "brown" twice and "dog" once. <2> Documents 2 and 3 both contain brown and dog once each, and the title field is the same length in both docs, so they have the same score. <3> Document 1 matches even though it contains only brown , not dog . Because the match query has to look for two terms— ["brown","dog"] —internally it has to execute two term queries and combine their individual results into the overall result. To do this, it wraps the two term queries in a bool query, which we examine in detail in <>. The important thing to take away from this is that any document whose title field contains at least one of the specified terms will match the query. The more terms that match, the more relevant the document. [[match-improving-precision]] ==== Improving Precision Matching any document that contains any of the query terms may result in a long tail of seemingly irrelevant results. ((("full text search", "multi-word queries", "improving precision")))((("precision", "improving for full text search multi-word queries"))) It's a shotgun approach to search. Perhaps we want to show only documents that contain all of the query terms. In other words, instead of brown OR dog , we want to return only documents that match brown AND dog . The match query accepts an operator parameter((("match query", "operator parameter")))((("or operator", "in match queries")))((("and operator", "in match queries"))) that defaults to or . You can change it to and to require that all specified terms must match: [source,js] GET /my_index/my_type/_search { "query": { "match": { "title": { <1> "query": "BROWN DOG!", "operator": "and" } } }

181.} // SENSE: 100_Full_Text_Search/05_Match_query.json <1> The structure of the match query has to change slightly in order to accommodate the operator parameter. This query would exclude document 1, which contains only one of the two terms. [[match-precision]] ==== Controlling Precision The choice between all and any is a bit((("full text search", "multi-word queries", "controlling precision"))) too black-or-white. What if the user specified five query terms, and a document contains only four of them? Setting operator to and would exclude this document. Sometimes that is exactly what you want, but for most full-text search use cases, you want to include documents that may be relevant but exclude those that are unlikely to be relevant. In other words, we need something in-between. The match query supports((("match query", "minimum_should_match parameter")))((("minimum_should_match parameter"))) the minimum_should_match parameter, which allows you to specify the number of terms that must match for a document to be considered relevant. While you can specify an absolute number of terms, it usually makes sense to specify a percentage instead, as you have no control over the number of words the user may enter: [source,js] GET /my_index/my_type/_search { "query": { "match": { "title": { "query": "quick brown dog", "minimum_should_match": "75%" } } } } // SENSE: 100_Full_Text_Search/05_Match_query.json When specified as a percentage, minimum_should_match does the right thing: in the preceding example with three terms, 75% would be rounded down to 66.6% , or two out of the three terms. No matter what you set it to, at least one term must match for a document to be considered a match. [NOTE] The minimum_should_match parameter is flexible, and different rules can be applied depending on the number of terms the user enters. For the full documentation see the http://www.elasticsearch.org/guide/en/elasticsearch/refer ence/current/query-dsl-minimum-should- match.html#query-dsl-minimum-should-match To fully understand how the match query handles multiword queries, we need to look at how to combine multiple queries with the bool query.

182.[[bool-query]] === Combining Queries In <> we discussed how to((("full text search", "combining queries"))), use the bool filter to combine multiple filter clauses with and , or , and not logic. In query land, the bool query does a similar job but with one important difference. Filters make a binary decision: should this document be included in the results list or not? Queries, however, are more subtle. They decide not only whether to include a document, but also how relevant that document is. Like the filter equivalent, the bool query accepts((("bool query"))) multiple query clauses under the must , must_not , and should parameters. For instance: [source,js] GET /my_index/my_type/_search { "query": { "bool": { "must": { "match": { "title": "quick" }}, "must_not": { "match": { "title": "lazy" }}, "should": [ { "match": { "title": "brown" }}, { "match": { "title": "dog" }} ] } } } // SENSE: 100_Full_Text_Search/15_Bool_query.json The results from the preceding query include any document whose title field contains the term quick , except for those that also contain lazy . So far, this is pretty similar to how the bool filter works. The difference comes in with the two should clauses, which say that: a document is not required to contain ((("should clause", "in bool queries")))either brown or dog , but if it does, then it should be considered more relevant: [source,js] { "hits": [ { "_id": "3", "_score": 0.70134366, <1> "_source": { "title": "The quick brown fox jumps over the quick dog" } }, { "_id": "1", "_score": 0.3312608, "_source": { "title": "The quick brown fox" } } ] } <1> Document 3 scores higher because it contains both brown and dog . ==== Score Calculation The bool query calculates((("relevance scores", "calculation in bool queries")))((("bool query", "score calculation"))) the relevance _score for each document by adding together the _score from all of the matching must and should clauses, and then dividing by the total number of must and should clauses. The must_not clauses do not affect ((("must_not clause", "in bool queries")))the score; their only purpose is to exclude documents that might otherwise have been included. ==== Controlling Precision All the must clauses must match, and all the must_not clauses must not match, but how many should clauses((("bool query", "controlling precision")))((("full text search", "combining queries", "controlling precision")))((("precision", "controlling for bool query"))) should match? By default, none of the should clauses are required to match, with one exception: if there are no must clauses, then at least one should clause must match. Just as we can control the <>, we can control how many should clauses need to match by using the minimum_should_match parameter,((("minimum_should_match parameter", "in bool queries"))) either as an absolute number or as a percentage:

183.[source,js] GET /my_index/my_type/_search { "query": { "bool": { "should": [ { "match": { "title": "brown" }}, { "match": { "title": "fox" }}, { "match": { "title": "dog" }} ], "minimum_should_match": 2 <1> } } } // SENSE: 100_Full_Text_Search/15_Bool_query.json <1> This could also be expressed as a percentage. The results would include only documents whose title field contains "brown" AND "fox" , "brown" AND "dog" , or "fox" AND "dog" . If a document contains all three, it would be considered more relevant than those that contain just two of the three.

184.=== How match Uses bool By now, you have probably realized that <> simply wrap((("match query", "use of bool query in multi-word searches"))) ((("bool query", "use by match query in multi-word searches")))((("full text search", "how match query uses bool query"))) the generated term queries in a bool query. With the default or operator, each term query is added as a should clause, so at least one clause must match. These two queries are equivalent: [source,js] { "match": { "title": "brown fox"} } [source,js] { "bool": { "should": [ { "term": { "title": "brown" }}, { "term": { "title": "fox" }} ] } } With the and operator, all the term queries are added as must clauses, so all clauses must match. These two queries are equivalent: [source,js] { "match": { "title": { "query": "brown fox", "operator": "and" } } } [source,js] { "bool": { "must": [ { "term": { "title": "brown" }}, { "term": { "title": "fox" }} ] } } And if the minimum_should_match parameter is((("minimum_should_match parameter", "match query using bool query"))) specified, it is passed directly through to the bool query, making these two queries equivalent: [source,js] { "match": { "title": { "query": "quick brown fox", "minimum_should_match": "75%" } } } [source,js] { "bool": { "should": [ { "term": { "title": "brown" }}, { "term": { "title": "fox" }}, { "term": { "title": "quick" }} ], "minimum_should_match": 2 <1> }

185.} <1> Because there are only three clauses, the minimum_should_match value of 75% in the match query is rounded down to 2 . At least two out of the three should clauses must match. Of course, we would normally write these types of queries by using the match query, but understanding how the match query works internally lets you take control of the process when you need to. Some things can't be done with a single match query, such as give more weight to some query terms than to others. We will look at an example of this in the next section.

186.=== Boosting Query Clauses Of course, the bool query isn't restricted ((("full text search", "boosting query clauses")))to combining simple one-word match queries. It can combine any other query, including other bool queries.((("relevance scores", "controlling weight of query clauses"))) It is commonly used to fine-tune the relevance _score for each document by combining the scores from several distinct queries. Imagine that we want to search for documents((("bool query", "boosting weight of query clauses")))((("weight", "controlling for query clauses"))) about "full-text search," but we want to give more weight to documents that also mention "Elasticsearch" or "Lucene." By more weight, we mean that documents mentioning "Elasticsearch" or "Lucene" will receive a higher relevance _score than those that don't, which means that they will appear higher in the list of results. A simple bool query allows us to write this fairly complex logic as follows: [source,js] GET /_search { "query": { "bool": { "must": { "match": { "content": { <1> "query": "full text search", "operator": "and" } } }, "should": [ <2> { "match": { "content": "Elasticsearch" }}, { "match": { "content": "Lucene" }} ] } } } // SENSE: 100_Full_Text_Search/25_Boost.json <1> The content field must contain all of the words full , text , and search . <2> If the content field also contains Elasticsearch or Lucene , the document will receive a higher _score . The more should clauses that match, the more relevant the document. So far, so good. But what if we want to give more weight to the docs that contain Lucene and even more weight to the docs containing Elasticsearch ? We can control ((("boost parameter")))the relative weight of any query clause by specifying a boost value, which defaults to 1 .A boost value greater than 1 increases the relative weight of that clause. So we could rewrite the preceding query as follows: [source,js] GET /_search { "query": { "bool": { "must": { "match": { <1> "content": { "query": "full text search", "operator": "and" } } }, "should": [ { "match": { "content": { "query": "Elasticsearch", "boost": 3 <2> } }}, { "match": { "content": { "query": "Lucene", "boost": 2 <3> } }} ] } } } // SENSE: 100_Full_Text_Search/25_Boost.json <1> These clauses use the default boost of 1 . <2> This clause is the most important, as it has the highest boost . <3> This clause is more important than the default, but not as important as the Elasticsearch clause. [NOTE]

187.[[boost-normalization]] The boost parameter is used to increase((("boost parameter", "score normalied after boost applied"))) the relative weight of a clause (with a boost greater than 1 ) or decrease the relative weight (with a boost between 0 and 1 ), but the increase or decrease is not linear. In other words, a boost of 2 does not result in double the _score . Instead, the new _score is normalized after((("normalization", "score normalied after boost applied"))) the boost is applied. Each type of query has its own normalization algorithm, and the details are beyond the scope of this book. Suffice to say that a higher boost value results in a higher _score . If you are implementing your own scoring model not based on TF/IDF and you need more control over the boosting process, you can use the <> to((("function_score query"))) manipulate a document's boost without the normalization step. We present other ways of combining queries in the next chapter, <>. But first, let's take a look at the other important feature of queries: text analysis.

188.=== Controlling Analysis Queries can find only terms that actually ((("full text search", "controlling analysis")))((("analysis", "controlling")))exist in the inverted index, so it is important to ensure that the same analysis process is applied both to the document at index time, and to the query string at search time so that the terms in the query match the terms in the inverted index. Although we say document, analyzers are determined per field.((("analyzers", "determined per-field"))) Each field can have a different analyzer, either by configuring a specific analyzer for that field or by falling back on the type, index, or node defaults. At index time, a field's value is analyzed by using the configured or default analyzer for that field. For instance, let's add a new field to my_index : [source,js] PUT /my_index/_mapping/my_type { "my_type": { "properties": { "english_title": { "type": "string", "analyzer": "english" } } } } // SENSE: 100_Full_Text_Search/30_Analysis.json Now we can compare how values in the english_title field and the title field are analyzed at index time by using the analyze API to analyze the word Foxes : [source,js] GET /my_index/_analyze?field=my_type.title <1> Foxes GET /my_index/_analyze?field=my_type.english_title <2> Foxes // SENSE: 100_Full_Text_Search/30_Analysis.json <1> Field title , which uses the default standard analyzer, will return the term foxes . <2> Field english_title , which uses the english analyzer, will return the term fox . This means that, were we to run a low-level term query for the exact term fox , the english_title field would match but the title field would not. High-level queries like the match query understand field mappings and can apply the correct analyzer for each field being queried.((("match query", "applying appropriate analyzer to each field"))) We can see this in action with ((("validate query API")))the validate-query API: [source,js] GET /my_index/my_type/_validate/query?explain { "query": { "bool": { "should": [ { "match": { "title": "Foxes"}}, { "match": { "english_title": "Foxes"}} ] } } } // SENSE: 100_Full_Text_Search/30_Analysis.json

189.which returns this explanation : (title:foxes english_title:fox) The match query uses the appropriate analyzer for each field to ensure that it looks for each term in the correct format for that field. ==== Default Analyzers While we can specify an analyzer at the field level,((("full text search", "controlling analysis", "default analyzers"))) ((("analyzers", "default"))) how do we determine which analyzer is used for a field if none is specified at the field level? Analyzers can be specified at several levels. Elasticsearch works through each level until it finds an analyzer that it can use. At index time, the order ((("indexing", "applying analyzers")))is as follows: The analyzer defined in the field mapping, else The analyzer defined in the _analyzer field of the document, else The default analyzer for the type , which defaults to The analyzer named default in the index settings, which defaults to The analyzer named default at node level, which defaults to The standard analyzer At search time, the ((("searching", "applying analyzers")))sequence is slightly different: The analyzer defined in the query itself, else The analyzer defined in the field mapping, else The default analyzer for the type , which defaults to The analyzer named default in the index settings, which defaults to The analyzer named default at node level, which defaults to The standard analyzer [NOTE] The two lines in italics in the preceding lists highlight differences in the index time sequence and the search time sequence. The _analyzer field allows you to specify a default analyzer for each document (for example, english , french , spanish ) while the analyzer parameter in the query specifies which analyzer to use on the query string. However, this is not the best way to handle multiple languages in a single index because of the pitfalls highlighted in <>. Occasionally, it makes sense to use a different analyzer at index and search time.((("analyzers", "using different analyzers at index and search time"))) For instance, at index time we may want to index synonyms (for example, for every occurrence of quick , we also index fast , rapid , and speedy ). But at search time, we don't need to search for all of these synonyms. Instead we can just look up the single word that the user has entered, be it quick , fast , rapid , or speedy . To enable this distinction, Elasticsearch also supports ((("index_analyzer parameter")))((("search_analyzer parameter")))the index_analyzer and search_analyzer parameters, and((("default_search parameter"))) ((("default_index analyzer")))analyzers named default_index and default_search . Taking these extra parameters into account, the full sequence at index time really looks like this: The index_analyzer defined in the field mapping, else The analyzer defined in the field mapping, else The analyzer defined in the _analyzer field of the document, else

190. The default index_analyzer for the type , which defaults to The default analyzer for the type , which defaults to The analyzer named default_index in the index settings, which defaults to The analyzer named default in the index settings, which defaults to The analyzer named default_index at node level, which defaults to The analyzer named default at node level, which defaults to The standard analyzer And at search time: The analyzer defined in the query itself, else The search_analyzer defined in the field mapping, else The analyzer defined in the field mapping, else The default search_analyzer for the type , which defaults to The default analyzer for the type , which defaults to The analyzer named default_search in the index settings, which defaults to The analyzer named default in the index settings, which defaults to The analyzer named default_search at node level, which defaults to The analyzer named default at node level, which defaults to The standard analyzer ==== Configuring Analyzers in Practice The sheer number of places where you can specify an analyzer is quite overwhelming.((("full text search", "controlling analysis", "configuring analyzers in practice")))((("analyzers", "configuring in practice"))) In practice, though, it is pretty simple. ===== Use index settings, not config files The first thing to remember is that, even though you may start out using Elasticsearch for a single purpose or a single application such as logging, chances are that you will find more use cases and end up running several distinct applications on the same cluster. Each index needs to be independent and independently configurable. You don't want to set defaults for one use case, only to have to override them for another use case later. This rules out configuring analyzers at the node level. Additionally, configuring analyzers at the node level requires changing the config file on every node and restarting every node, which becomes a maintenance nightmare. It's a much better idea to keep Elasticsearch running and to manage settings only via the API. ===== Keep it simple Most of the time, you will know what fields your documents will contain ahead of time. The simplest approach is to set the analyzer for each full-text field when you create your index or add type mappings. While this approach is slightly more verbose, it enables you to easily see which analyzer is being applied to each field. Typically, most of your string fields will be exact-value not_analyzed fields such as tags or enums, plus a handful of full-text fields that will use some default analyzer like standard or english or some other language. Then you may have one or two fields that need custom analysis: perhaps the title field needs to be indexed in a way that supports find-as-you-type. You can set the default analyzer in the index to the analyzer you want to use for almost all full-text fields, and just configure the specialized analyzer on the one or two fields that need it. If, in your model, you need a different default analyzer per type, then use the type level analyzer setting instead. [NOTE] A common work flow for time based data like logging is to create a new index per day on the fly by just indexing into it. While this work flow prevents you from creating your index up front, you can still use http://bit.ly/1ygczeq[index templates]

191.to specify the settings and mappings that a new index should have.

192.[[relevance-is-broken]] === Relevance Is Broken! Before we move on to discussing more-complex queries in <>, let's make a quick detour to explain why we <> with just one primary shard. Every now and again a new user opens an issue claiming that sorting by relevance((("relevance", "differences in IDF producing incorrect results"))) is broken and offering a short reproduction: the user indexes a few documents, runs a simple query, and finds apparently less-relevant results appearing above more-relevant results. To understand why this happens, let's imagine that we create an index with two primary shards and we index ten documents, six of which contain the word foo . It may happen that shard 1 contains three of the foo documents and shard 2 contains the other three. In other words, our documents are well distributed. In <>, we described the default similarity algorithm used in Elasticsearch, ((("Term Frequency/Inverse Document Frequency (TF/IDF) similarity algorithm")))called term frequency / inverse document frequency or TF/IDF. Term frequency counts the number of times a term appears within the field we are querying in the current document. The more times it appears, the more relevant is this document. The inverse document frequency takes((("inverse document frequency")))((("IDF", see="inverse document frequency"))) into account how often a term appears as a percentage of all the documents in the index. The more frequently the term appears, the less weight it has. However, for performance reasons, Elasticsearch doesn't calculate the IDF across all documents in the index.((("shards", "local inverse document frequency (IDF)"))) Instead, each shard calculates a local IDF for the documents contained in that shard. Because our documents are well distributed, the IDF for both shards will be the same. Now imagine instead that five of the foo documents are on shard 1, and the sixth document is on shard 2. In this scenario, the term foo is very common on one shard (and so of little importance), but rare on the other shard (and so much more important). These differences in IDF can produce incorrect results. In practice, this is not a problem. The differences between local and global IDF diminish the more documents that you add to the index. With real-world volumes of data, the local IDFs soon even out. The problem is not that relevance is broken but that there is too little data. For testing purposes, there are two ways we can work around this issue. The first is to create an index with one primary shard, as we did in the section introducing the <>. If you have only one shard, then the local IDF is the global IDF. The second workaround is to add ?search_type=dfs_query_then_fetch to your search requests. The dfs stands((("searchtype", "dfs_query_then_fetch")))((("dfs_query_then_fetch search type")))((("DFS (Distributed Frequency Search)"))) for _Distributed Frequency Search, and it tells Elasticsearch to first retrieve the local IDF from each shard in order to calculate the global IDF across the whole index. TIP: Don't use dfs_query_then_fetch in production. It really isn't required. Just having enough data will ensure that your term frequencies are well distributed. There is no reason to add this extra DFS step to every query that you run.