序
本文主要分析一个频繁产生对象造成gc时间过长的case。
症状及分析
gc时间过长,平均gc pause的时间要将近4秒,有13%的gc超过10秒,太可怕了,部分gc日志如下:
[PSYoungGen: 457878K->126656K(489472K)] 1746043K->1453131K(1887744K), 12.1965757 secs] [Times: user=5.59 sys=0.52, real=12.19 secs]
154415.774: [GC (Allocation Failure)
Desired survivor size 212860928 bytes, new threshold 1 (max 15)
[PSYoungGen: 376192K->65968K(484864K)] 1702667K->1392499K(1883136K), 0.1665513 secs] [Times: user=0.10 sys=0.00, real=0.17 secs]
154416.838: [GC (Allocation Failure)
Desired survivor size 235929600 bytes, new threshold 1 (max 15)
[PSYoungGen: 341424K->196182K(445952K)] 1667955K->1523034K(1844224K), 1.7996294 secs] [Times: user=0.89 sys=0.03, real=1.80 secs]
154419.456: [GC (Allocation Failure)
Desired survivor size 225968128 bytes, new threshold 1 (max 15)
[PSYoungGen: 434262K->121776K(468480K)] 1761114K->1486938K(1866752K), 23.0844304 secs] [Times: user=10.75 sys=0.81, real=23.09 secs]
154442.541: [Full GC (Ergonomics) [PSYoungGen: 121776K->0K(468480K)] [ParOldGen: 1365162K->208108K(1398272K)] 1486938K->208108K(1866752K), [Metaspace: 93615K->93615K(1132544K)], 23.5955214 secs] [Times: user=3.76 sys=5.30, real=23.59 secs]
154504.670: [GC (Allocation Failure)
Desired survivor size 217579520 bytes, new threshold 1 (max 15)
[PSYoungGen: 200553K->106368K(486400K)] 408662K->314476K(1884672K), 1.0664613 secs] [Times: user=0.39 sys=0.13, real=1.06 secs]
154507.542: [GC (Allocation Failure)
Desired survivor size 218103808 bytes, new threshold 1 (max 15)
[PSYoungGen: 372096K->144182K(478208K)] 580204K->425927K(1876480K), 5.7999561 secs] [Times: user=1.12 sys=1.55, real=5.80 secs]
154514.037: [GC (Allocation Failure)
Desired survivor size 213909504 bytes, new threshold 1 (max 15)
[PSYoungGen: 409910K->87920K(489984K)] 691655K->407999K(1888256K), 10.1020217 secs] [Times: user=4.46 sys=0.61, real=10.11 secs]
154563.240: [GC (Allocation Failure)
Desired survivor size 213385216 bytes, new threshold 1 (max 15)
[PSYoungGen: 328380K->65952K(485888K)] 648460K->386087K(1884160K), 0.0918412 secs] [Times: user=0.04 sys=0.01, real=0.09 secs]
154564.037: [GC (Allocation Failure)
Desired survivor size 219676672 bytes, new threshold 1 (max 15)
[PSYoungGen: 342944K->153558K(478208K)] 663079K->474022K(1876480K), 3.1948641 secs] [Times: user=0.72 sys=0.69, real=3.19 secs]
154568.135: [GC (Allocation Failure)
Desired survivor size 212336640 bytes, new threshold 1 (max 15)
[PSYoungGen: 423382K->98528K(484352K)] 743846K->457302K(1882624K), 13.4085860 secs] [Times: user=6.04 sys=0.69, real=13.41 secs]
通过jmap dump下内存之后,使用mat分享,查看thread_overview
可以看到ElasticsearchJestHealthIndicator.doHealthCheck持有了很多对象没释放
调试及复现
本地复现
spring-context-4.3.7.RELEASE-sources.jar!/org/springframework/jmx/export/SpringModelMBean.java
/**
* Switches the {@link Thread#getContextClassLoader() context ClassLoader} for the
* managed resources {@link ClassLoader} before allowing the invocation to occur.
* @see javax.management.modelmbean.ModelMBean#invoke
*/
@Override
public Object invoke(String opName, Object[] opArgs, String[] sig)
throws MBeanException, ReflectionException {
ClassLoader currentClassLoader = Thread.currentThread().getContextClassLoader();
try {
Thread.currentThread().setContextClassLoader(this.managedResourceClassLoader);
return super.invoke(opName, opArgs, sig);
}
finally {
Thread.currentThread().setContextClassLoader(currentClassLoader);
}
}
spring-boot-actuator-1.4.5.RELEASE-sources.jar!/org/springframework/boot/actuate/endpoint/jmx/DataEndpointMBean.java
@ManagedResource
public class DataEndpointMBean extends EndpointMBean {
/**
* Create a new {@link DataEndpointMBean} instance.
* @param beanName the bean name
* @param endpoint the endpoint to wrap
* @param objectMapper the {@link ObjectMapper} used to convert the payload
*/
public DataEndpointMBean(String beanName, Endpoint<?> endpoint,
ObjectMapper objectMapper) {
super(beanName, endpoint, objectMapper);
}
@ManagedAttribute(description = "Invoke the underlying endpoint")
public Object getData() {
return convert(getEndpoint().invoke());
}
}
系本地idea开启了Enable JMX Agent才可以复现
healthEndpoint
spring-boot-admin-server-1.4.6-sources.jar!/de/codecentric/boot/admin/registry/StatusUpdateApplicationListener.java
private long updatePeriod = 10_000L;
public void startStatusUpdate() {
if (scheduledTask != null && !scheduledTask.isDone()) {
return;
}
scheduledTask = taskScheduler.scheduleAtFixedRate(new Runnable() {
@Override
public void run() {
statusUpdater.updateStatusForAllApplications();
}
}, updatePeriod);
LOGGER.debug("Scheduled status-updater task for every {}ms", updatePeriod);
}
spring-boot-admin-server-1.4.6-sources.jar!/de/codecentric/boot/admin/registry/StatusUpdater.java
public void updateStatusForAllApplications() {
long now = System.currentTimeMillis();
for (Application application : store.findAll()) {
if (now - statusLifetime > application.getStatusInfo().getTimestamp()) {
updateStatus(application);
}
}
}
public void updateStatus(Application application) {
StatusInfo oldStatus = application.getStatusInfo();
StatusInfo newStatus = queryStatus(application);
Application newState = Application.create(application).withStatusInfo(newStatus).build();
store.save(newState);
if (!newStatus.equals(oldStatus)) {
publisher.publishEvent(
new ClientApplicationStatusChangedEvent(newState, oldStatus, newStatus));
}
}
protected StatusInfo queryStatus(Application application) {
LOGGER.trace("Updating status for {}", application);
try {
@SuppressWarnings("unchecked")
ResponseEntity<Map<String, Object>> response = restTemplate.getForEntity(
application.getHealthUrl(), (Class<Map<String, Object>>) (Class<?>) Map.class);
LOGGER.debug("/health for {} responded with {}", application, response);
if (response.hasBody() && response.getBody().get("status") instanceof String) {
return StatusInfo.valueOf((String) response.getBody().get("status"));
} else if (response.getStatusCode().is2xxSuccessful()) {
return StatusInfo.ofUp();
} else {
return StatusInfo.ofDown();
}
} catch (Exception ex) {
if ("OFFLINE".equals(application.getStatusInfo().getStatus())) {
LOGGER.debug("Couldn't retrieve status for {}", application, ex);
} else {
LOGGER.warn("Couldn't retrieve status for {}", application, ex);
}
return StatusInfo.ofOffline();
}
}
可以看到这个admin-server注册了个定时任务,定时调用/health
问题分析
ElasticsearchHealthIndicatorConfiguration
spring-boot-actuator-1.4.5.RELEASE-sources.jar!/org/springframework/boot/actuate/autoconfigure/ElasticsearchHealthIndicatorConfiguration.java
class ElasticsearchHealthIndicatorConfiguration {
@Configuration
@ConditionalOnBean(Client.class)
@ConditionalOnEnabledHealthIndicator("elasticsearch")
@EnableConfigurationProperties(ElasticsearchHealthIndicatorProperties.class)
static class ElasticsearchClientHealthIndicatorConfiguration extends
CompositeHealthIndicatorConfiguration<ElasticsearchHealthIndicator, Client> {
private final Map<String, Client> clients;
private final ElasticsearchHealthIndicatorProperties properties;
ElasticsearchClientHealthIndicatorConfiguration(Map<String, Client> clients,
ElasticsearchHealthIndicatorProperties properties) {
this.clients = clients;
this.properties = properties;
}
@Bean
@ConditionalOnMissingBean(name = "elasticsearchHealthIndicator")
public HealthIndicator elasticsearchHealthIndicator() {
return createHealthIndicator(this.clients);
}
@Override
protected ElasticsearchHealthIndicator createHealthIndicator(Client client) {
return new ElasticsearchHealthIndicator(client, this.properties);
}
}
@Configuration
@ConditionalOnBean(JestClient.class)
@ConditionalOnEnabledHealthIndicator("elasticsearch")
static class ElasticsearchJestHealthIndicatorConfiguration extends
CompositeHealthIndicatorConfiguration<ElasticsearchJestHealthIndicator, JestClient> {
private final Map<String, JestClient> clients;
ElasticsearchJestHealthIndicatorConfiguration(Map<String, JestClient> clients) {
this.clients = clients;
}
@Bean
@ConditionalOnMissingBean(name = "elasticsearchHealthIndicator")
public HealthIndicator elasticsearchHealthIndicator() {
return createHealthIndicator(this.clients);
}
@Override
protected ElasticsearchJestHealthIndicator createHealthIndicator(
JestClient client) {
return new ElasticsearchJestHealthIndicator(client);
}
}
}
ElasticsearchHealthIndicator
spring-boot-actuator-1.4.5.RELEASE-sources.jar!/org/springframework/boot/actuate/health/ElasticsearchHealthIndicator.java
@Override
protected void doHealthCheck(Health.Builder builder) throws Exception {
List<String> indices = this.properties.getIndices();
ClusterHealthResponse response = this.client.admin().cluster()
.health(Requests.clusterHealthRequest(indices.isEmpty() ? allIndices
: indices.toArray(new String[indices.size()])))
.actionGet(this.properties.getResponseTimeout());
switch (response.getStatus()) {
case GREEN:
case YELLOW:
builder.up();
break;
case RED:
default:
builder.down();
break;
}
builder.withDetail("clusterName", response.getClusterName());
builder.withDetail("numberOfNodes", response.getNumberOfNodes());
builder.withDetail("numberOfDataNodes", response.getNumberOfDataNodes());
builder.withDetail("activePrimaryShards", response.getActivePrimaryShards());
builder.withDetail("activeShards", response.getActiveShards());
builder.withDetail("relocatingShards", response.getRelocatingShards());
builder.withDetail("initializingShards", response.getInitializingShards());
builder.withDetail("unassignedShards", response.getUnassignedShards());
}
spring-boot-actuator-1.4.5.RELEASE-sources.jar!/org/springframework/boot/actuate/health/ElasticsearchHealthIndicatorProperties.java
@ConfigurationProperties(prefix = "management.health.elasticsearch", ignoreUnknownFields = false)
public class ElasticsearchHealthIndicatorProperties {
/**
* Comma-separated index names.
*/
private List<String> indices = new ArrayList<String>();
/**
* Time, in milliseconds, to wait for a response from the cluster.
*/
private long responseTimeout = 100L;
//......
}
如果有指定indices,则会查询他们的健康情况,比如/index1,index2/_stats,如果没有则查询所有的indices,这个就是个潜在的坑。
ElasticsearchJestHealthIndicator
spring-boot-actuator-1.4.5.RELEASE-sources.jar!/org/springframework/boot/actuate/health/ElasticsearchJestHealthIndicator.java
@Override
protected void doHealthCheck(Health.Builder builder) throws Exception {
JestResult aliases = this.jestClient.execute(new Stats.Builder().build());
JsonElement root = this.jsonParser.parse(aliases.getJsonString());
JsonObject shards = root.getAsJsonObject().get("_shards").getAsJsonObject();
int failedShards = shards.get("failed").getAsInt();
if (failedShards != 0) {
builder.outOfService();
}
else {
builder.up();
}
}
jest最后发出的http请求是/_all/_stats,这个就是问题所在,查询所有_all的统计数据,对于一个大的elasticsearch平台来说,返回的数据是巨大的,将近5000条数据,返回的json纯文件都要20多M,这个再加上定时任务/health查询,导致新生对象不断产生,ygc非常频繁,造成内存泄露的现象。
小结
定时监控应用health是个好东东,但是得注意频率,另外还得关注具体实现,像elasticsearch的这个稍不注意就被坑了,相当于定时产生一定量的垃圾,频率超过垃圾回收的速度,类似内存泄露,给应用gc带来很大的负担。