So we were in a similar boat, our storage network blinked and 2/3 Elasticsearch nodes got hit with read-only disks. Indices went red, errors that wouldn’t clear, retries exhausted, _cluster/reroute seemingly not working even after reboots, fscks, and service restarts.
[2019-11-14T11:13:11,166][DEBUG][o.e.a.a.c.a.TransportClusterAllocationExplainAction] [clusty-mcclusterface] explaining the allocation for [ClusterAllocationExplainRequest[useAnyUnassignedShard=true, includeYesDecisions?=false], found shard [prod_syslog_linux_0][0], node[null], [P], recovery_source[existing recovery], [UNASSIGNED], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-11-14T08:05:20.187Z], failed_attempts[5], delayed=false, details[failed to create shard, failure FileSystemException[/data/elasticsearch/graylogesearch1-graylog2/nodes/0/indices/jpJkzjabSKaf_CnWgyughA/0/_state/state-2.st.tmp: Read-only file system]], allocation_status[deciders_no]]]
We found your article before lunch and were going to try it after, but I had a few minutes to kill…
“Couldn’t hurt to hit reroute again”
I happened to have a previous count of red/yellow/green indices up on my screen, and only then did I notice that I had a handful fewer reds, and a handful more yellows/greens.
“This is too stupid to work”
But, inexplicably, repeatedly hitting _cluster/reroute?retry_failed=true seems to have brought all of our indices back to green with no apparent loss of data. We have exactly zero ideas as to why this worked, but it did, and it might for others.
Crack open a couple terminals. In one run:
while true curl -s localhost:9200/_cat/indices | awk ‘{print $1}’ | sort | uniq -c | tr “\n” “ “; echo ‘’; sleep 5; done
Which gets you a running count of statuses. Then in the other run your curl -s localhost:9200/_cluster/reroute?retry_failed=true
and if you’re the right kind of lucky you might see an index or two change state.
No guarantee, no explanation, but no reason not to give it a try if you’re up this particular creek.
Elasticsearch Version: 5.6.8