记一次线上cluster_block_exception blocked by forbidden/12/index read-only / allow delete (api)问题的排查记录

背景

用户在使用线上某服务时(涉及到es已有索引写入操作), 反馈服务总是失败. 而其他服务(涉及es的读/新索引的写入)均正常.

排查过程

经过对后台日志的分析, 发现如下错误日志:

{'index': {'_index': 'bi_4769dcd50ba048cd163bfd48e2f500f7', '_type': 'doc', '_id': 'evoOy24BgwL6ozngNCWG', 'status': 403, 'error': {'type':'cluster_block_exception', 'reason': 'blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];'}}

经过Google, 发现此问题是由于es数据存储磁盘剩余空间过少导致的. 官网对此的说明如下:

cluster.routing.allocation.disk.watermark.flood_stage

Controls the flood stage watermark. It defaults to 95%, meaning that Elasticsearch enforces a read-only index block (index.blocks.read_only_allow_delete) on every index that has one or more shards allocated on the node that has at least one disk exceeding the flood stage. This is a last resort to prevent nodes from running out of disk space. The index block must be released manually once there is enough disk space available to allow indexing operations to continue.

即es存在一种flood_stage的机制. 默认磁盘空间设置为95%, 当磁盘占用超过此值时, 将会触发flood_stage机制, es将强制将各索引index.blocks.read_only_allow_delete设置为true, 即仅允许只读只删, 不允许新增.

以上排查结果表明: 线上服务出问题的原因在于es索引均被设置为只读只删模式. 所以导致索引数据写入时失败.

问题原因

前两周, 在线上服务所在机器上, 进行了涉及MySQL大量数据(涉及数据量达1亿量级)的读写操作, 当时磁盘空间被耗尽. 触发es的flood_stage, 全部索引被设置为只读只删.

但因近期(一个月左右), 没有用户使用对es索引写入数据服务, 导致一直未发现此问题.

解决方法

解决方法很简单, 仅需将对应es节点上的索引设置进行如下设置即可.
PUT _settings
{
  "index": {
      "blocks": {
          "read_only_allow_delete": "false"
      }
  }
}

反思!!!

在线上进行某些操作时, 应对服务器资源占用有个粗略的估计或对服务器资源进行监控.

避免在执行某些操作时, 耗尽服务器资源而导致其他一些服务的异常

引用

ES报错 [FORBIDDEN/12/index read-only / allow delete (api)] - read only elasticsearch indices

Disk-based Shard Allocation