集群的生命周期管理

预上线
- 评估用户的需求及使用场景 / 数据建模 / 容量规划 / 选择合适的部署架构 / 性能测试
上线
- 监控流量 / 定期检查潜在问题（防患于未然，发现错误的使用方式，及时增加机器）
- 对索引进行优化（Index Lifecycle Management），检测是否存在不均衡导致有部分节点过热
- 定期数据备份 / 滚动升级
下架前监控流量，实现Stage Decommission

部署的建议

根据实际场景，选择合适的部署模式，选择合理的硬件配置
- 搜索类
- 日志 / 指标
部署要考虑，反亲和性（Anti-Affinity）
- 尽量将机器分散在不同的机架。例如，3台Master节点必须分散在不同的机架上
- 善用Shard Filtering进行配置

使用要遵循一定的规范

Mapping
- 生产环境中索引应考虑禁止Dynamic Index Mapping，避免过多字段导致Cluster State 占用过多
- 禁止索引自动创建的功能，创建时必须提供Mapping或通过Index Template进行设定

PUT _cluster/settings
{
  "persistent": {
    "action.auto_create_index": false
  }
}

PUT _cluster/settings
{
  "persistent": {
    "action.auto_create_index": ".moniroting-*,logstash-*"
  }
}

使用要遵循一定的规范

设置slowlogs，发现一些性能不好，甚至是错误的使用Pattern
- 例如：错误的将网址映射成keyword，然后用通配符查询。应该使用Text，结合URL分词器
- 严禁一切"*"开头的通配符查询

对重要的数据进行备份

定期更新到新版本

ES在新版本中会持续对性能作出优化；提供更多的新功能
- Circuit breaker实现的改进
修复一些一直的bug和安全隐患

ES的版本

Elasticsearch的版本格式是：X.Y.Z
- X：Major
- Y：Minor
- Z：Patch
Elasticsearch可以使用上一个主版本的索引
- 7.x可以使用6.x / 7.x不支持使用5.x
- 5.x可以使用2.x

Rolling Upgrade v.s Full Cluster Restart

Rolling Upgrade
- 没有Downtime
- https://www.elastic.co/guide/en/elasticsearch/reference/7.1/rolling-upgrades.html
Full Cluster Restart
- 集群再更新期间不可用
- 升级更快

Full Restart的步骤

停止索引数据，同时备份集群
Disable Shard Allocation（Persistent）
执行Synced Flush
关闭并更新所有节点
先运行所有Master节点 / 再运行其他节点
等集群变黄后打开Shard Allocation

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.enable": "promaries"
  }
}

POST _flush/synced

运维Cheat Sheet：移动分片

从一个节点移动分片到另一个节点
使用场景：
- 当一个数据节点上有过多Hot Shards；可以通过手动分配分片到特定的节点解决

POST _cluster/reroute
{
 "commands": [
   {
     "move": {
       "index": "index_name",
       "shard": 0,
       "from_node": "node1",
       "to_node": "node2"
     }
   }
 ]
}

运维Cheat Sheet：从集群中移除一个节点

使用场景：当你想移除一个节点，或者对一个机器进行维护。同时你又不希望导致集群的颜色变黄或变红

PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.exclude._ip": "the ip of your node"
  }
}

运维Cheat Sheet：控制Allocation和Recovery

使用场景：控制Allocation和Recovery

#change the number of moving shards to balance the cluster
PUT _cluster/settings
{
  "transient": {
    "cluster": {
      "routing": {
        "allocation.cluster_concurrent_rebalance": 2
      }
    }
  }
}

#change the number of shards being recovered simultanceously per node
PUT _cluster/settings
{
  "transient": {
    "cluster": {
      "routing": {
        "allocation.node_concurrent_recoveries": 2
      }
    }
  }
}

#change the recovery speed
PUT _cluster/settings
{
  "transient": {
    "indices": {
      "recovery.max_bytes_per_sec": "80mb"
    }
  }
}

#change the number of concurrent streams for a recovery on a single node
PUT _cluster/settings
{
  "transient": {
    "indices": {
      "recovery.concurrent_streams": 6
    }
  }
}

运维Cheat Sheet：Synced Flush

使用场景：需要重启一个节点
- 通过synced flush，可以在索引上放置一个sync ID。这样可以提供这些分片的Recovery的时间

POST _flush/synced

运维Cheat Sheet：清空节点上的缓存

使用场景：节点上出现了高内存占用。可以执行清除缓存的操作。这个操作会影响集群的性能，但是会避免你的集群出现OOM的问题

POST _cache/clear

运维Cheat Sheet：控制搜索的队列

使用场景：当搜索的响应时间长，看到有reject指标的增加，都可以适当增加该数值

PUT _cluster/settings
{
  "transient": {
    "threadpool.search.queue_size": 2000
  }
}

设置Circuit Breaker

使用场景：设置各类Circuit Breaker。避免OOM的发生

PUT _cluster/settings
{
  "persistent": {
    "indices": {
      "breaker": {
        "total.limit": "40%"
      }
    }
  }
}

70 - ES 一些运维的相关建议

70 - ES 一些运维的相关建议

集群的生命周期管理

部署的建议

使用要遵循一定的规范

使用要遵循一定的规范

对重要的数据进行备份

定期更新到新版本

ES的版本

Rolling Upgrade v.s Full Cluster Restart

Full Restart的步骤

运维Cheat Sheet：移动分片

运维Cheat Sheet：从集群中移除一个节点

运维Cheat Sheet：控制Allocation和Recovery

运维Cheat Sheet：Synced Flush

运维Cheat Sheet：清空节点上的缓存

运维Cheat Sheet：控制搜索的队列

设置Circuit Breaker

相关阅读更多精彩内容

友情链接更多精彩内容