监控系统
数据可视化:Grafana
数据存储:InfluxDB/Prometheus
数据采集:Telegraf/NodeExporter
Grafana
Grafana官方提供了很多dashboard,可以用来呈现操作系统、数据库、应用程序的运行状态。
我选择了以下几个dashboard:
系统dashboard:https://grafana.com/grafana/dashboards/928
数据库dashboard:https://grafana.com/grafana/dashboards/1177
java应用dashboard:https://grafana.com/grafana/dashboards/4701
这里选择的系统dashboard和数据库dashboard采用了InfluxDB作为数据源,InfluxDB一般通过Telegraf采集数据。
Java应用dashboard采用了Prometheus作为数据源,Prometheus一般通过NodeExporter采集数据,对于Java应用,可以借助micrometer采集数据。
参考资料:
Grafana安装:
https://grafana.com/docs/grafana/latest/installation/rpm/#install-manually-with-yum
Grafana基本操作,包括创建数据源、创建dashboard等。
https://grafana.com/tutorials/grafana-fundamentals/#1
InfluxDB
InfluxDB概念
概念 | 数据库 | 表 | 记录 | 数据保留多久,保留多少份 | 索引字段 | 普通字段 | 记录的时间戳 |
---|---|---|---|---|---|---|---|
InfluxDB | database | measurement | point | retention policy | tag | field | timestamp |
MySQL | database | table | row | indexed column | column |
参考资料:
https://docs.influxdata.com/influxdb/v1.8/concepts/key_concepts/
Sample Data
- 创建数据库
CREATE DATABASE NOAA_water_database
- 下载并写入数据
curl https://s3.amazonaws.com/noaa.water-database/NOAA_data.txt -o NOAA_data.txt
influx -import -path=NOAA_data.txt -precision=s -database=NOAA_water_database
- 测试查询
> SHOW measurements
name: measurements
------------------
name
average_temperature
h2o_feet
h2o_pH
h2o_quality
h2o_temperature
> SELECT COUNT("water_level") FROM h2o_feet
name: h2o_feet
--------------
time count
1970-01-01T00:00:00Z 15258
> SELECT * FROM h2o_feet LIMIT 2
name: h2o_feet
--------------
time level description location water_level
2015-08-18T00:00:00Z below 3 feet santa_monica 2.064
2015-08-18T00:00:00Z between 6 and 9 feet coyote_creek 8.12
参考资料:
https://docs.influxdata.com/influxdb/v1.8/query_language/sample-data/
Explore Schema
SHOW DATABASES
SHOW MEASUREMENTS
SHOW TAG KEYS
SHOW FIELD KEYS
参考资料:
https://docs.influxdata.com/influxdb/v1.8/query_language/explore-schema/
Explore Data
- The SELECT statement
SELECT <field_key>[,<field_key>,<tag_key>] FROM <measurement_name>[,<measurement_name>]
- The WHERE clause
SELECT_clause FROM_clause WHERE <conditional_expression> [(AND|OR) <conditional_expression> [...]]
- The GROUP By clause
SELECT_clause FROM_clause [WHERE_clause] GROUP BY [* | <tag_key>[,<tag_key]]
ORDER BY time DESC
The LIMIT and SLIMIT clauses
参考资料:
https://docs.influxdata.com/influxdb/v1.8/query_language/explore-data/
Functions
聚合(Aggregations)
选择(Selectors)
转换(Transformations)
参考资料:
https://docs.influxdata.com/influxdb/v1.8/query_language/functions/
Telegraf
telegraf用于采集数据,输出到influxdb中。
telegraf支持采集系统和数据库的指标数据,只需要在/etc/telegraf/telegraf.conf做简单的配置。
telegraf在写入数据时,会为每一条数据加上一个tag[host],用来区分是哪个应用上报的数据。host的值可以在telegraf.conf中配置,也可以修改linux hostname。
### OUTPUT
# Configuration for influxdb server to send metrics to
[[outputs.influxdb]]
urls = ["http://localhost:8089"]
database = "telegraf_metrics"
## Retention policy to write to. Empty string writes to the default rp.
retention_policy = ""
## Write consistency (clusters only), can be: "any", "one", "quorum", "all"
write_consistency = "any"
## Write timeout (for the InfluxDB client), formatted as a string.
## If not provided, will default to 5s. 0s means no timeout (not recommended).
timeout = "5s"
# Read metrics about cpu usage
[[inputs.cpu]]
## Whether to report per-cpu stats or not
percpu = true
## Whether to report total system cpu stats or not
totalcpu = true
## Comment this line if you want the raw CPU time metrics
fielddrop = ["time_*"]
# Read metrics about disk usage by mount point
[[inputs.disk]]
## By default, telegraf gather stats for all mountpoints.
## Setting mountpoints will restrict the stats to the specified mountpoints.
# mount_points = ["/"]
## Ignore some mountpoints by filesystem type. For example (dev)tmpfs (usually
## present on /run, /var/run, /dev/shm or /dev).
ignore_fs = ["tmpfs", "devtmpfs"]
# Read metrics about disk IO by device
[[inputs.diskio]]
## By default, telegraf will gather stats for all devices including
## disk partitions.
## Setting devices will restrict the stats to the specified devices.
# devices = ["sda", "sdb"]
## Uncomment the following line if you need disk serial numbers.
# skip_serial_number = false
# Get kernel statistics from /proc/stat
[[inputs.kernel]]
# no configuration
# Read metrics about memory usage
[[inputs.mem]]
# no configuration
# Get the number of processes and group them by status
[[inputs.processes]]
# no configuration
# Read metrics about swap memory usage
[[inputs.swap]]
# no configuration
# Read metrics about system load & uptime
[[inputs.system]]
# no configuration
# Read metrics about network interface usage
[[inputs.net]]
# collect data only about specific interfaces
# interfaces = ["eth0"]
[[inputs.netstat]]
# no configuration
[[inputs.mysql]]
server = ["root:root@tcp(127.0.0.1:3306)/"]
Prometheus
架构
概念
概念 | 数据库 | 表 | 记录 | 数据保留多久,保留多少份 | 索引字段 | 普通字段 | 记录的时间戳 |
---|---|---|---|---|---|---|---|
Prometheus | - | metric | time series | - | - | label | timestamp |
InfluxDB | database | measurement | point | retention policy | tag | field | timestamp |
MySQL | database | table | row | indexed column | column |
Prometheus和InfluxDB差异:
Prometheus metric的一条记录由多个label加一个value构成,metric类型包括Counter、Gauge、Histogram、Summary,InfluxDB measurement并没有区分这些类型。
Prometheus通过pull的方式拉取数据,InfluxDB通过push的方式推送数据。
Prometheus的一条记录一般只有一个value,同样是记录cpu的指标数据,InfluxDB measurement会包含3个field[usage_idle, usage_system, usage_user],1条记录[97, 2, 1],Prometheus table会包含1个label[mode],3条记录['idle', 97], ['system', 2], ['user', 1]。
参考资料:
https://prometheus.io/docs/concepts/metric_types/
查询数据
Prometheus通过网页查询数据,默认地址是http://your_host:9090。
${Prometheus_home}/prometheus.yml文件可以添加需要拉取数据的实例(instance),通过Metric Up 可以查看所有实例的工作状态。
参考资料:
https://prometheus.io/docs/prometheus/latest/querying/examples/
Micrometer
micrometer用于采集java应用的指标数据,可以适配多数主流的监控系统,比如Prometheus、InfluxDB。有点像SLF4J,适配很多日志系统,而micrometer面向的是应用的Metrics。
使用Spring为Prometheus提供指标数据:
@Controller
@RequestMapping(value = "/prometheus")
public class PrometheusController {
@Getter
private PrometheusMeterRegistry registry;
@PostConstruct
private void init() {
PrometheusConfig config = k -> {
return null;
};
this.registry = new PrometheusMeterRegistry(config);
this.registry.config().commonTags("application", "myAppName");
new ClassLoaderMetrics().bindTo(this.registry);
new JvmMemoryMetrics().bindTo(this.registry);
new JvmGcMetrics().bindTo(this.registry);
new ProcessorMetrics().bindTo(this.registry);
new JvmThreadMetrics().bindTo(this.registry);
}
@RequestMapping(method = { RequestMethod.Get, RequestMethod.POST})
public void index(HttpServletRequest req, HttpServletResponse resp) {
resp.getWriter().write(registry.scrape());
resp.getWriter().flush();
}
}
参考资料: