Spark 应用监控告警和自动重启

Spark on yarn 执行流计算时,如果流挂了,没有提醒会导致实时指标计算停滞,为了保证流的7/24运行,需要有一个能监控Spark on yarn上的应用,实现失败重启、失败告警,同时能展示Spark应用的相关指标,实现数据质量的监控和管理。

一、指标监控模块

使用Prometheus、graphite_exporter、Grafana实现Spark应用指标监控。

1.Spark配置Graphite metrics

spark 是自带 Graphite Sink 的,只需要配置一下metrics.properties:

*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink  
*.sink.graphite.protocol=tcp 
*.sink.graphite.host=127.0.0.1 
*.sink.graphite.port=9109 
*.sink.graphite.period=5 
*.sink.graphite.unit=seconds 
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource 
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource

提交时记得使用 --files /path/to/spark/conf/metrics.properties 参数将配置文件分发到所有的 Executor。

2.安装启动graphite_exporter

prometheus 提供了一个插件(graphite_exporter),可以将 Graphite metrics 进行转化并写入 Prometheus (本文的方式),

image.png
  • 启动 graphite_exporter 时加载配置文件

./graphite_exporter --graphite.mapping-config=graphite_exporter_mapping

graphite_exporter_mapping:

mappings:
- match: '*.*.executor.filesystem.*.*'
  name: filesystem_usage
  labels:
    application: $1
    executor_id: $2
    fs_type: $3
    qty: $4

- match: '*.*.jvm.*.*'
  name: jvm_memory_usage
  labels:
    application: $1
    executor_id: $2
    mem_type: $3
    qty: $4

- match: '*.*.executor.jvmGCTime.count'
  name: jvm_gcTime_count
  labels:
    application: $1
    executor_id: $2

- match: '*.*.jvm.pools.*.*'
  name: jvm_memory_pools
  labels:
    application: $1
    executor_id: $2
    mem_type: $3
    qty: $4

- match: '*.*.executor.threadpool.*'
  name: executor_tasks
  labels:
    application: $1
    executor_id: $2
    qty: $3

- match: '*.*.BlockManager.*.*'
  name: block_manager
  labels:
    application: $1
    executor_id: $2
    type: $3
    qty: $4

- match: DAGScheduler.*.*
  name: DAG_scheduler
  labels:
    type: $1
    qty: $2

3.配置 Prometheus和Grafana

修改/path/to/prometheus/prometheus.yml,增加

scrape_configs: - job_name: 'spark' static_configs: - targets: ['localhost:9108']

配置Grafana并加载Json配置:

spark_prometheus.json

{ "id": 1, "title": "Spark Prometheus", "originalTitle": "Spark Prometheus", "tags": [], "style": "dark", "timezone": "browser", "editable": true, "hideControls": false, "sharedCrosshair": false, "rows": [ { "collapse": false, "editable": true, "height": "250px", "panels": [ { "aliasColors": {}, "bars": false, "datasource": "Prometheus", "editable": true, "error": false, "fill": 0, "grid": { "leftLogBase": 1, "leftMax": null, "leftMin": 0, "rightLogBase": 1, "rightMax": null, "rightMin": null, "threshold1": null, "threshold1Color": "rgba(216, 200, 27, 0.27)", "threshold2": null, "threshold2Color": "rgba(234, 112, 112, 0.22)" }, "id": 1, "legend": { "avg": false, "current": false, "max": false, "min": false, "show": true, "total": false, "values": false }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "connected", "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [ { "alias": "total", "color": "#F2C96D", "linewidth": 7, "yaxis": 2, "zindex": 3 } ], "span": 4, "stack": false, "steppedLine": false, "targets": [ { "expr": "filesystem_usage{exported_job=\"read_bytes\", fs_type=\"hdfs\", application=\"$application_ID\"}", "intervalFactor": 2, "legendFormat": "{{executor_id}}", "metric": "filesystem_usage", "refId": "A", "step": 2, "target": "" }, { "expr": "sum(filesystem_usage{exported_job=\"read_bytes\", fs_type=\"hdfs\", application=\"$application_ID\"})", "intervalFactor": 2, "legendFormat": "total", "metric": "filesystem_usage", "refId": "B", "step": 2, "target": "" } ], "timeFrom": null, "timeShift": null, "title": "Executor HDFS reads", "tooltip": { "shared": true, "value_type": "cumulative" }, "type": "graph", "x-axis": true, "y-axis": true, "y_formats": [ "short", "short" ] }, { "aliasColors": {}, "bars": false, "datasource": "Prometheus", "editable": true, "error": false, "fill": 1, "grid": { "leftLogBase": 1, "leftMax": null, "leftMin": null, "rightLogBase": 1, "rightMax": null, "rightMin": null, "threshold1": null, "threshold1Color": "rgba(216, 200, 27, 0.27)", "threshold2": null, "threshold2Color": "rgba(234, 112, 112, 0.22)" }, "id": 2, "legend": { "avg": false, "current": false, "max": false, "min": false, "show": false, "total": false, "values": false }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "connected", "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [], "span": 4, "stack": true, "steppedLine": false, "targets": [ { "expr": "filesystem_usage{exported_job=\"write_bytes\", fs_type=\"hdfs\", application=\"$application_ID\"}", "intervalFactor": 2, "legendFormat": "{{executor_id}}", "metric": "filesystem_usage", "refId": "A", "step": 2, "target": "" } ], "timeFrom": null, "timeShift": null, "title": "Executor HDFS writes", "tooltip": { "shared": true, "value_type": "cumulative" }, "type": "graph", "x-axis": true, "y-axis": true, "y_formats": [ "short", "short" ] }, { "aliasColors": {}, "bars": false, "datasource": "Prometheus", "editable": true, "error": false, "fill": 1, "grid": { "leftLogBase": 1, "leftMax": null, "leftMin": null, "rightLogBase": 1, "rightMax": null, "rightMin": null, "threshold1": null, "threshold1Color": "rgba(216, 200, 27, 0.27)", "threshold2": null, "threshold2Color": "rgba(234, 112, 112, 0.22)" }, "id": 3, "legend": { "avg": false, "current": false, "max": false, "min": false, "show": true, "total": false, "values": false }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "connected", "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [ { "alias": "total", "yaxis": 2 } ], "span": 4, "stack": true, "steppedLine": false, "targets": [ { "expr": "avg(rate(filesystem_usage{exported_job=\"read_bytes\", fs_type=\"hdfs\", application=\"$application_ID\"}[1m]))", "intervalFactor": 1, "legendFormat": "average", "metric": "filesystem_usage", "refId": "A", "step": 1, "target": "" }, { "expr": "sum(rate(filesystem_usage{exported_job=\"read_bytes\", fs_type=\"hdfs\", application=\"$application_ID\"}[1m]))", "intervalFactor": 1, "legendFormat": "total", "metric": "filesystem_usage", "refId": "B", "step": 1, "target": "" } ], "timeFrom": null, "timeShift": null, "title": "HDFS Read Rate / s", "tooltip": { "shared": true, "value_type": "cumulative" }, "type": "graph", "x-axis": true, "y-axis": true, "y_formats": [ "short", "short" ] } ], "title": "HDFS stats" }, { "collapse": false, "editable": true, "height": "250px", "panels": [ { "aliasColors": {}, "bars": false, "datasource": "Prometheus", "editable": true, "error": false, "fill": 0, "grid": { "leftLogBase": 1, "leftMax": null, "leftMin": 0, "rightLogBase": 1, "rightMax": null, "rightMin": null, "threshold1": null, "threshold1Color": "rgba(216, 200, 27, 0.27)", "threshold2": null, "threshold2Color": "rgba(234, 112, 112, 0.22)" }, "id": 4, "legend": { "avg": false, "current": false, "max": false, "min": false, "show": true, "total": false, "values": false }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "connected", "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [], "span": 6, "stack": false, "steppedLine": false, "targets": [ { "expr": "jvm_memory_usage{mem_type=\"heap\", qty=\"usage\", application=\"$application_ID\", executor_id=\"driver\"}", "intervalFactor": 2, "legendFormat": "{{executor_id}}", "metric": "jvm_memory_usage", "refId": "A", "step": 2, "target": "" } ], "timeFrom": null, "timeShift": null, "title": "Driver Heap Usage", "tooltip": { "shared": true, "value_type": "cumulative" }, "type": "graph", "x-axis": true, "y-axis": true, "y_formats": [ "short", "short" ] }, { "aliasColors": {}, "bars": false, "datasource": "Prometheus", "editable": true, "error": false, "fill": 1, "grid": { "leftLogBase": 1, "leftMax": null, "leftMin": null, "rightLogBase": 1, "rightMax": null, "rightMin": null, "threshold1": null, "threshold1Color": "rgba(216, 200, 27, 0.27)", "threshold2": null, "threshold2Color": "rgba(234, 112, 112, 0.22)" }, "id": 5, "legend": { "avg": false, "current": false, "max": false, "min": false, "show": true, "total": false, "values": false }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "connected", "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [], "span": 6, "stack": false, "steppedLine": false, "targets": [ { "expr": "jvm_memory_pools{executor_id=\"driver\", application=\"$application_ID\",qty=\"used\"}", "intervalFactor": 2, "legendFormat": "{{mem_type}}", "metric": "jvm_memory_pools", "refId": "A", "step": 2, "target": "" } ], "timeFrom": null, "timeShift": null, "title": "Driver JVM Memory Pools", "tooltip": { "shared": true, "value_type": "cumulative" }, "type": "graph", "x-axis": true, "y-axis": true, "y_formats": [ "short", "short" ] }, { "aliasColors": {}, "bars": false, "datasource": "Prometheus", "editable": true, "error": false, "fill": 0, "grid": { "leftLogBase": 1, "leftMax": null, "leftMin": null, "rightLogBase": 1, "rightMax": null, "rightMin": null, "threshold1": null, "threshold1Color": "rgba(216, 200, 27, 0.27)", "threshold2": null, "threshold2Color": "rgba(234, 112, 112, 0.22)" }, "id": 6, "legend": { "avg": false, "current": false, "max": false, "min": false, "show": false, "total": false, "values": false }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "connected", "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [], "span": 4, "stack": false, "steppedLine": false, "targets": [ { "expr": "jvm_memory_usage{mem_type=\"heap\", qty=\"usage\", application=\"$application_ID\", executor_id!~\"driver\"} ", "intervalFactor": 2, "legendFormat": "{{executor_id}}", "metric": "jvm_memory_usage", "refId": "A", "step": 2, "target": "" } ], "timeFrom": null, "timeShift": null, "title": "Executor Heap Usage", "tooltip": { "shared": true, "value_type": "cumulative" }, "type": "graph", "x-axis": true, "y-axis": true, "y_formats": [ "short", "short" ] }, { "aliasColors": {}, "bars": false, "datasource": "Prometheus", "editable": true, "error": false, "fill": 0, "grid": { "leftLogBase": 1, "leftMax": null, "leftMin": null, "rightLogBase": 1, "rightMax": null, "rightMin": null, "threshold1": null, "threshold1Color": "rgba(216, 200, 27, 0.27)", "threshold2": null, "threshold2Color": "rgba(234, 112, 112, 0.22)" }, "id": 7, "legend": { "avg": false, "current": false, "max": false, "min": false, "show": false, "total": false, "values": false }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "connected", "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [], "span": 4, "stack": false, "steppedLine": false, "targets": [ { "expr": "jvm_memory_pools{executor_id!~\"driver\", application=\"$application_ID\",qty=\"usage\", mem_type=\"PS-Eden-Space\"}", "intervalFactor": 2, "legendFormat": "{{executor_id}}", "metric": "jvm_memory_pools", "refId": "A", "step": 2, "target": "" } ], "timeFrom": null, "timeShift": null, "title": "Executor Eden-Space", "tooltip": { "shared": true, "value_type": "cumulative" }, "type": "graph", "x-axis": true, "y-axis": true, "y_formats": [ "short", "short" ] }, { "aliasColors": {}, "bars": false, "datasource": "Prometheus", "editable": true, "error": false, "fill": 0, "grid": { "leftLogBase": 1, "leftMax": null, "leftMin": null, "rightLogBase": 1, "rightMax": null, "rightMin": null, "threshold1": null, "threshold1Color": "rgba(216, 200, 27, 0.27)", "threshold2": null, "threshold2Color": "rgba(234, 112, 112, 0.22)" }, "id": 9, "legend": { "avg": false, "current": false, "max": false, "min": false, "show": true, "total": false, "values": false }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "connected", "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [], "span": 4, "stack": false, "steppedLine": false, "targets": [ { "expr": "jvm_memory_pools{executor_id!~\"driver\", application=\"$application_ID\",qty=\"usage\", mem_type=\"PS-Old-Gen\"}", "intervalFactor": 2, "legendFormat": "{{executor_id}}", "metric": "jvm_memory_pools", "refId": "A", "step": 2, "target": "" } ], "timeFrom": null, "timeShift": null, "title": "Executor Old-Gen", "tooltip": { "shared": true, "value_type": "cumulative" }, "type": "graph", "x-axis": true, "y-axis": true, "y_formats": [ "short", "short" ] }, { "aliasColors": {}, "bars": false, "datasource": "Prometheus", "editable": true, "error": false, "fill": 1, "grid": { "leftLogBase": 1, "leftMax": null, "leftMin": null, "rightLogBase": 1, "rightMax": null, "rightMin": null, "threshold1": null, "threshold1Color": "rgba(216, 200, 27, 0.27)", "threshold2": null, "threshold2Color": "rgba(234, 112, 112, 0.22)" }, "id": 10, "legend": { "avg": false, "current": false, "max": false, "min": false, "show": true, "total": false, "values": false }, "lines": true, "linewidth": 2, "links": [], "nullPointMode": "connected", "percentage": false, "pointradius": 5, "points": false, "renderer": "flot", "seriesOverrides": [ { "alias": "completeTasks", "yaxis": 2 } ], "span": 12, "stack": false, "steppedLine": false, "targets": [ { "expr": "sum(executor_tasks{application=\"$application_ID\", qty!~\"maxPool_size\"}) by (qty)", "intervalFactor": 2, "legendFormat": "{{qty}}", "refId": "B", "step": 2, "target": "" } ], "timeFrom": null, "timeShift": null, "title": "Tasks", "tooltip": { "shared": true, "value_type": "cumulative" }, "type": "graph", "x-axis": true, "y-axis": true, "y_formats": [ "short", "short" ] } ], "title": "New row" } ], "time": { "from": "now-5m", "to": "now" }, "timepicker": { "now": true, "refresh_intervals": [ "5s", "10s", "30s", "1m", "5m", "15m", "30m", "1h", "2h", "1d" ], "time_options": [ "5m", "15m", "1h", "6h", "12h", "24h", "2d", "7d", "30d" ] }, "templating": { "list": [ { "allFormat": "glob", "current": { "text": "application_1564363240533_7587", "value": "application_1564363240533_7587" }, "datasource": null, "includeAll": false, "multi": false, "multiFormat": "glob", "name": "application_ID", "options": [ { "text": "application_1564363240533_7587", "value": "application_1564363240533_7587", "selected": false }, { "text": "application_1564363240533_7592", "value": "application_1564363240533_7592", "selected": false }, { "text": "application_1564363240533_7615", "value": "application_1564363240533_7615", "selected": false } ], "query": "label_values(application)", "refresh": true, "refresh_on_load": false, "type": "query" } ] }, "annotations": { "list": [] }, "refresh": "5s", "schemaVersion": 8, "version": 0, "links": [] }

二、进程监控失败重启和告警模块

监控yarn上指定的Spark应用是否存在,不存在则发出告警。

使用Python脚本查看yarn状态,指定监控应用,应用中断则通过webhook发送报警信息到钉钉群,并且自动重启。

#!/usr/bin/python3.5
# -*- coding: utf-8 -*-
import os
import json
import requests

'''
Yarn应用监控:当配置的应用名不在yarn applicaition -list时,钉钉告警
'''


def yarn_list(applicatin_list):
    yarn_application_list = os.popen('yarn application -list').read()
    result = ""
    for appName in applicatin_list:
        if appName in yarn_application_list:
            print("应用:%s 正常!" % appName)
        else:
            result += ("告警--应用:%s 中断!" % appName)
            if "应用名1" == appName:
                os.system('重启命令')

    return result


def dingding_robot(data):
    # 机器人的webhooK 获取地址参考:https://open-doc.dingtalk.com/microapp/serverapi2/qf2nxq
    webhook = "https://oapi.dingtalk.com/robot/send?access_token" \
              "=你的token "
    headers = {'content-type': 'application/json'}  # 请求头
    r = requests.post(webhook, headers=headers, data=json.dumps(data))
    r.encoding = 'utf-8'
    return r.text


if __name__ == '__main__':
    applicatin_list = ["应用名1", "应用名2", "应用名3"]
    output = yarn_list(applicatin_list)
    print(output)

    if len(output) > 0:
        # 请求参数 可以写入配置文件中
        data = {
            "msgtype": "text",
            "text": {
                "content": output
            },
            "at": {
                "atMobiles": [
                    "xxxxxxx"
                ],
                "isAtAll": False
            }
        }
        res = dingding_robot(data)
        print(res)  # 打印请求结果
    else:
        print("一切正常!")

三、指标监控告警

TBC:使用Prometheus的AlertManager模块实现指标的监控。

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 217,734评论 6 505
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,931评论 3 394
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 164,133评论 0 354
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,532评论 1 293
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,585评论 6 392
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,462评论 1 302
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,262评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,153评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,587评论 1 314
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,792评论 3 336
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,919评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,635评论 5 345
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,237评论 3 329
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,855评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,983评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,048评论 3 370
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,864评论 2 354