All data nodes of One ES Cluster have read-only issue
Diagnose
There's one ES Cluster combined with VMS with volumes attached to it as below:
Node
Role
es5
data
es6
data
es7
data
es8
master
es9
master
es10
master
All the 3 data nodes are attached with one cinder volume "/data1" as one of their data folder.
However due to one Network issue happened last Friday, all of the 3 volumes became "Read-Only File System", which means they're not able to write new documents in.
Then it impacts the whole ES Cluster to do operation on shards (distribute/replica). So the whole ES Cluster is in red status since some indices having shard issues.
Then restart the Elasticsearch service on es5, and see exceptions as below:
Caused by: java.io.FileNotFoundException: /data1/instance00/stackstash-elasticsearch/nodes/0/indices/alert_08052014/3/index/_8h5j.fdx (Read-only file system)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:212)
at java.io.FileOutputStream.<init>(FileOutputStream.java:165)
at org.apache.lucene.store.FSDirectory$FSIndexOutput.<init>(FSDirectory.java:384)
at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:277)
at org.apache.lucene.store.FileSwitchDirectory.createOutput(FileSwit
The attached volume is still in "Read-Only" mode.
Then reboot the data node, and "Read-Only" issue is resolved.
However Elasticsearch service can’t start on es5 node because "failed to read local state". There may be some broken state files due to read-only issue.
The same thing happened on es6 and es7.
So till now, the ES Cluster only has 3 master nodes working normally. While all of the 3 data nodes can not start up Elasticsearch service.
How to recover
The replica count for each index was set to 2 which means there was a backup of all the data on the remaining 2 data node anyways.
Shutdown the data node with the corrupt Lucene shards, which is es5.
Start up Elasticsearch service on es6 and es7 by disable shard-allocation (can do it on any master node):