- 背景
最近线上在更新集群部分做了优化,上线后发现滚动更新集群有些节点经常会出现failed to obtain node locks 问题。查了网上资料都说是有es节点还没有关闭,就重新启动了,但是我们滚动更新时都会通过shell脚本去关闭es的进程,然后在更新节点,理论上进程都没了,es应该也关闭了呀,不会存在两个es的进程呀。 - 带着疑惑看了源码,解决了上面问题,下面跟我一起来解决它。
在elasticsearch 的Bootstrap启动类中有下面一段钩子代码:
if (addShutdownHook) {
Runtime.getRuntime().addShutdownHook(new Thread() {
@Override
public void run() {
try {
Thread.sleep(100000);
IOUtils.close(node, spawner);
LoggerContext context = (LoggerContext)
LogManager.getContext(false);
Configurator.shutdown(context);
if (node != null && node.awaitClose(10, TimeUnit.SECONDS) == false) {
throw new IllegalStateException("Node didn't stop within 10 seconds. " +
"Any outstanding requests or tasks might get killed.");
}
} catch (IOException ex) {
throw new ElasticsearchException("failed to stop node", ex);
} catch (InterruptedException e) {
LogManager.getLogger(Bootstrap.class).warn("Thread got
interrupted while waiting for the node to shutdown.");
Thread.currentThread().interrupt();
}
}
});
}
经过测试,在shell中通过kill -9 关闭进程就会触发调用上面的钩子,kill -15强制杀进程就不会调用,我们就是通过kill -9进行进程关闭的,所以上面会触发,该钩子主要负责清理资源,比如线程关闭等。
IOUtils.close(node, spawner) 方法会去关闭Node对象的资源。Node对象的close方法如下:
@Override
public synchronized void close() throws IOException {
synchronized (lifecycle) {
if (lifecycle.started()) {
stop();
}
if (!lifecycle.moveToClosed()) {
return;
}
}
logger.info("closing ...");
List<Closeable> toClose = new ArrayList<>();
StopWatch stopWatch = new StopWatch("node_close");
toClose.add(() -> stopWatch.start("node_service"));
toClose.add(nodeService);
toClose.add(() -> stopWatch.stop().start("http"));
toClose.add(injector.getInstance(HttpServerTransport.class));
toClose.add(() -> stopWatch.stop().start("snapshot_service"));
toClose.add(injector.getInstance(SnapshotsService.class));
toClose.add(injector.getInstance(SnapshotShardsService.class));
toClose.add(() -> stopWatch.stop().start("client"));
Releasables.close(injector.getInstance(Client.class));
toClose.add(() -> stopWatch.stop().start("indices_cluster"));
toClose.add(injector.getInstance(IndicesClusterStateService.class));
toClose.add(() -> stopWatch.stop().start("indices"));
toClose.add(injector.getInstance(IndicesService.class));
// close filter/fielddata caches after indices
toClose.add(injector.getInstance(IndicesStore.class));
toClose.add(() -> stopWatch.stop().start("routing"));
toClose.add(injector.getInstance(RoutingService.class));
toClose.add(() -> stopWatch.stop().start("cluster"));
toClose.add(injector.getInstance(ClusterService.class));
toClose.add(() -> stopWatch.stop().start("node_connections_service"));
toClose.add(injector.getInstance(NodeConnectionsService.class));
toClose.add(() -> stopWatch.stop().start("discovery"));
toClose.add(injector.getInstance(Discovery.class));
toClose.add(() -> stopWatch.stop().start("monitor"));
toClose.add(nodeService.getMonitorService());
toClose.add(() -> stopWatch.stop().start("gateway"));
toClose.add(injector.getInstance(GatewayService.class));
toClose.add(() -> stopWatch.stop().start("search"));
toClose.add(injector.getInstance(SearchService.class));
toClose.add(() -> stopWatch.stop().start("transport"));
toClose.add(injector.getInstance(TransportService.class));
for (LifecycleComponent plugin : pluginLifecycleComponents) {
toClose.add(() -> stopWatch.stop().start("plugin(" +
plugin.getClass().getName() + ")"));
toClose.add(plugin);
}
toClose.addAll(pluginsService.filterPlugins(Plugin.class));
toClose.add(() -> stopWatch.stop().start("script"));
toClose.add(injector.getInstance(ScriptService.class));
toClose.add(() -> stopWatch.stop().start("thread_pool"));
toClose.add(() -> injector.getInstance(ThreadPool.class).shutdown());
// Don't call shutdownNow here, it might break ongoing operations on
Lucene indices.
// See https://issues.apache.org/jira/browse/LUCENE-7248. We call
shutdownNow in
// awaitClose if the node doesn't finish closing within the specified time.
toClose.add(() -> stopWatch.stop().start("node_environment"));
toClose.add(injector.getInstance(NodeEnvironment.class));
toClose.add(stopWatch::stop);
if (logger.isTraceEnabled()) {
toClose.add(() -> logger.trace("Close times for each service:\n{}",
stopWatch.prettyPrint()));
}
IOUtils.close(toClose);
logger.info("closed");
}
所以在关闭时会打印出如下日志:
stopping ...
stopped
closing ...
closed
结合我们线上es日志发现还没有调用到closed就抛出了Node Lock异常,说明节点锁资源还没有清理掉。
由于我们场景主要是更新集群,需要先关闭节点,然后更新好了,启动节点的过程。所以需要在关闭节点时检测节点锁已经释放了,然后在启动就ok了。
后面写了节点锁代码,在shell脚本中通过该代码来判断节点锁已经关闭好了,而不是通过进程来判断。
节点锁代码如下:
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.env.Environment;
import org.elasticsearch.env.NodeEnvironment;
import java.io.IOException;
import java.util.function.Consumer;
public class CheckEsNodeLock {
public static void main(String[] args) throws IOException{
Settings settings = Settings.builder()
.put("node.max_local_storage_nodes", 1)
.put(Environment.PATH_HOME_SETTING.getKey(),
"/home/es/elasticsearch")
.put(Environment.PATH_DATA_SETTING.getKey(),
"/home/es/es_data").build();
try {
NodeEnvironment env = new NodeEnvironment(settings,
new Environment(settings, null), new Consumer<String>(){
public void accept(String s) {}
});
} catch(IllegalStateException ex) {
if (ex.getMessage().contains("failed to obtain node lock")) {
System.exit(1);
}
}
System.exit(0);
}
}