ES集群重启最佳实践

线上集群规模

data节点：12

单index大小：20GB ~ 40GB

副本数：1

索引数(按天拆分索引)：600

非安全重启面临的问题

直接kill掉节点，可能导致数据丢失

集群会认为该节点挂掉了，集群重新分配数据进行数据转移（shard rebalance），会导致节点直接大量传输数据

节点重启之后，恢复数据，同样产生大量的磁盘、网络流量，耗费机器和网络资源的。

安全重启步骤

暂停数据写入程序

关闭集群shard allocation

手动执行POST /_flush/synced

重启节点

重新开启集群shard allocation

等待recovery完成，集群health status变成green

重新开启数据写入程序

速度调优

可临时增大 max_bytes_per_sec；随后在进行更改

可以多节点同时操作

可以将历史索引的副本数暂时调整为0；集群恢复后在进行调整

使用 _forcemerge

相关API

synced flush: curl -XPOST localhost:9200/_flush/synced

_forcemerge: forcemerge?max_num_segments=1

禁用 shard allocation curl -XPUT localhost:9200/_cluster/settings { "persistent": { "cluster.routing.allocation.enable": "none" }}

启用 shard allocation：curl -XPUT localhost:9200/_cluster/settings { "persistent": { "cluster.routing.allocation.enable": "all" }}

增大max_bytes_per_sec：http://localhost:port/_cluster/settings?flat_settings=true{"transient" : {"indices.recovery.max_bytes_per_sec" : 200mb}}

恢复max_bytes_per_sec：http://localhost:port/_cluster/settings?flat_settings=true{"transient" : {"indices.recovery.max_bytes_per_sec" :null}}

一些查看恢复速度的API：curl localhost:9200/{index}/_stats?level=shards&pretty curl localhost:9200/{index}/_recovery?pretty&human&detailed=true curl localhost:9200/_cat/recovery

总结

按照上述操作，集群重启恢复(恢复80%)使用 1小时;随后就开启数据写入，恢复速率大大减慢，但并不影响正常使用

©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成，浏览时请结合常识与多方信息审慎甄别。
平台声明：文章内容（如有图片或视频亦包括在内）由作者上传并发布，文章内容仅代表作者本人观点，简书系信息发布平台，仅提供信息存储服务。

ES集群重启最佳实践

ES集群重启最佳实践

线上集群规模

非安全重启面临的问题

安全重启步骤

速度调优

相关API

总结

相关阅读更多精彩内容

友情链接更多精彩内容