Elasticsearch是一个开源的、高度可扩展的全文搜索与分析引擎。它可以存储海量的数据,能近乎实时地搜索和分析数据,能支撑复杂的查询需求。
Elasticsearch的使用场景有:
在线商店搜索
日志分析(ELK技术栈)
商品价格波动监控
海量数据的快速调查、分析、可视化和即席查询
Elasticsearch功能强大,使用简单,接下来我们将介绍Elasticsearch集群的搭建和简单使用,以快速上手。
基本概念
集群
集群由一个或多个节点构成,使用唯一的名字标识,默认为elasticsearch。如果一个网络环境中运行着多个Elasticsearch集群,集群名字最好不要相同。因为如果节点配置为根据集群名字加入集群,那么就会产生冲突。
节点
节点是集群中的单个服务器。节点也以名字进行标识,默认为UUID,在启动时获得。节点名字可以配置。集群可以包含任意多个节点,单节点也可以构成一个集群。
索引
索引是文档的集合。集群中可以创建任意多个索引,只要资源足够。
类型
索引中可以定义一个或多个类型,类型是索引下的逻辑分类,通常拥有共同字段的文档定义在一个类型之内。
文档
文档是索引中信息的基本单元。
分片(shard)和副本(replica)
索引可以存储大量的数据,会超过单个节点的硬件上限。例如,一个包含10亿文档的索引占1TB硬盘空间,单个节点要么空间不够,要么相应查询的速度太慢。
为了解决这一问题,Elasticsearch支持将一个索引分成多个小块,称为分片。在创建索引的时候可以定义分片数。每一个分片相当于一个功能完备的独立的小索引,可以存储在集群的任意节点上。
分片重要的原因有两点:
1. 它能水平拆分数据
2. 并行操作分片,提升吞吐量
在网络和云环境中,故障随时可能发生,因此故障恢复机制十分必要。Elasticsearch支持为分片创建一个或多个副本,称为分片副本。
副本有两个好处:
1. 高可用性。
2. 提升查询的吞吐量。
总的来说,每一个索引可以拆分成多个分片,可以复制多个副本,存在主分片和分片副本。分片数和副本数都可以在创建索引时指定,不同的是,分片数确定之后就不能更改,而副本数可以动态修改。
默认情况下,每个索引拥有5个主分片和一个副本(即5个分片,每个分片都有一个副本)。
每一个Elasticsearch分片都是一个Lucene索引。Lucene索引有文档数上限。在LUCENE-5843中,该上限为2,147,483,519 (=Integer.MAX_VALUE-128)。可以使用_cat/shards监控分片的大小。
curl -XGET gd01:9200/_cat/shards/20171229
20171229 1 p STARTED 1904509 369.5mb 132.98.16.178 data-178
20171229 1 r STARTED 1902986 383.6mb 132.98.16.176 master-176
20171229 3 r STARTED 1898048 349.7mb 132.98.16.178 data-178
20171229 3 p STARTED 1898595 492.2mb 132.98.16.177 data-177
20171229 2 r STARTED 1903094 481.2mb 132.98.16.178 data-178
20171229 2 p STARTED 1904497 526.9mb 132.98.16.176 master-176
20171229 4 p STARTED 1902180 487mb 132.98.16.178 data-178
20171229 4 r STARTED 1900635 586.9mb 132.98.16.176 master-176
20171229 0 p STARTED 1902472 421.6mb 132.98.16.177 data-177
20171229 0 r STARTED 1901511 511.8mb 132.98.16.176 master-176
Elasticsearch集群安装
Elasticsearch集群依赖JDK1.8,因此在安装之前应先安装好JDK1.8。
下载安装文件
curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.5.tar.gz
解压
tar -xvf elasticsearch-5.6.5.tar.gz
启动单节点
elasticsearch-5.6.5/bin/elasticsearch
集群配置
elasticsearch.yml示例
# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
# Before you set out to tweak and tune the configuration, make sure you
# understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: es-gotcha
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: node-${HOSTNAME}
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
node.master: true
node.data: false
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /var/data/es
#
# Path to log files:
#
path.logs: /var/log/es
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: 132.98.16.176
#
# Set a custom port for HTTP:
#
#http.port: 9200
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.zen.ping.unicast.hosts: ["132.98.16.176", "132.98.16.177", "132.98.16.179", "132.98.16.180", "132.98.16.182", "132.98.16.183", "132.98.16.184"]
#
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
#
discovery.zen.minimum_master_nodes: 3
#
# For more information, consult the zen discovery module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
action.destructive_requires_name: true
需要配置的有:
cluster.name
node.name
node.master,定义节点是否为主节点
node.data
network.host
discovery.zen.ping.unicast.hosts,Elasticsearch集群节点列表
discovery.zen.minimum_master_nodes,构成集群的最少主节点数
在多台机器上部署Elasticsearch,然后依次启动,节点会自动发现,并构成一个集群。
集群小试
Elasticsearch提供了REST API和Java API。接下来我们使用REST API。使用API,我们可以:
检查集群、节点、索引健康、状态和一些统计信息
管理集群、节点、索引数据和元数据
执行CRUD
执行高级搜索,如分页、排序、过滤、执行脚本、聚合等等
集群健康
curl -XGET gd01:9200/_cat/health?v
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1514533722 15:48:42 esbds green 3 3 20 10 0 0 0 0 - 100.0%
获取节点列表
curl -XGET gd01:9200/_cat/nodes?v
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
132.98.16.176 64 26 5 1.45 1.49 1.43 mdi * master-176
132.98.16.177 84 19 8 1.27 1.43 1.60 di - data-177
132.98.16.178 57 78 16 2.24 2.40 2.45 di - data-178
列举索引
curl -XGET gd01:9200/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open 20171228 ij-Y05EEQIimzEDYPyzvjw 5 1 7810000 3922446 2.6gb 1.3gb
green open 20171229 FUabFhc5TYyi4K_y81GJ9w 5 1 9905546 6122165 4.2gb 2.2gb
创建索引
curl -XPUT gd01:9200/test_idx?pretty
返回:
{
"acknowledged" : true,
"shards_acknowledged" : true,
"index" : "test_idx"
}
创建文档
在test_idx索引中创建类型为external,id为1的文档。
curl -XPUT gd01:9200/test_idx/external/1?pretty -d '
{
"name": "John Doe"
}'
返回
{
"_index" : "test_idx",
"_type" : "external",
"_id" : "1",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 2,
"failed" : 0
},
"created" : true
}
查询文档
curl -XGET gd01:9200/test_idx/external/1?pretty
返回
{
"_index" : "test_idx",
"_type" : "external",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source" : {
"name" : "John Doe"
}
}
bulk操作
批量创建文档
curl -XPOST gd01:9200/test_idx/external/_bulk?pretty -d '
{"index":{"_id":"1"}}
{"name": "John Doe" }
{"index":{"_id":"2"}}
{"name": "Jane Doe" }
'
bulk中的操作可以不一样
curl -XPOST gd01:9200/test_idx/external/_bulk?pretty -d '
{"update":{"_id":"1"}}
{"doc": { "name": "John Doe becomes Jane Doe" } }
{"delete":{"_id":"2"}}
'
查询
在Elasticsearch中,查询条件可以放在url中,也可以在请求体里。
url附带查询条件
curl -XGET gd01:9200/test_idx/external/_search?q=John
返回
{
"_index" : "test_idx",
"_type" : "external",
"_id" : "1",
"_version" : 2,
"found" : true,
"_source" : {
"name" : "John Doe"
}
}
请求体中附带查询条件
curl -XPOST gd01:9200/test_idx/external/_search?pretty -d '
{
"query": {
"term": {
"name": "John Doe"
}
}
}'
除了简单查询,Elasticsearch还支持:
过滤,请参考https://www.elastic.co/guide/en/elasticsearch/reference/5.6/_executing_filters.html
聚合,请参考https://www.elastic.co/guide/en/elasticsearch/reference/5.6/_executing_aggregations.html