Prometheus+node_exporter+alertmanager+prometheus_webhook_dingtalk+Grafana（非容器搭建）简单搭建监控报警平台笔记

本文主要是介绍Prometheus+node_exporter+alertmanager+prometheus_webhook_dingtalk+Grafana（非容器搭建）简单搭建监控报警平台笔记，对大家解决编程问题具有一定的参考价值，需要的程序猿们随着小编来一起学习吧！

一、搭建目的；

通过搭建过程，了解目前流行的监控系统。

二、搭建环境；

虚机

三、搭建配置调试过程；

1、prometheus相关安装包下载地址；https://prometheus.io/download/

2、grafana下载地址；https://grafana.com/grafana/download

3、安装

（1）、下载并解压安装prometheus（网上搜索教程，本笔记省略）；配置prometheus并启动prometheus;

prometheus.yml

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# Alertmanager configuration
#  - job_name: 'Alertmanager'
#    static_configs:

        #- alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

*注意targets为什么不用服务器ip而是用localhost因为如果用服务器ip的话，一旦服务器ip变了就无法使用*

启动prometheus命令进入安装目录 ./prometheus --config.file=prometheus.yml &

netstat –tpln可以看到已经监听9090端口，可以通过ip:9090访问prometheus;

（2）、安装启动node_exporter（网上搜索教程，本笔记省略);并接入到prometheus;

启动node_exporter;进入安装目录 ./node_exporter &

netstat –tpln可以看到已经监听9100端口

修改prometheus;并重启prometheus查看ip:9090上node_exporter服务是否接入并up成功；

prometheus.yml;

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
#  - job_name: 'Alertmanager'
#    static_configs:
    
        #- alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'node_self'
    scheme: http
    #tls_config:
      #ca_file: node_exporter.crt
    static_configs:
    - targets: ['localhost:9100']

重启prometheus在ip:9090看到如下图表示正常

（3）、安装配置alertmanager+prometheus_webhook_dingtalk完成报警收集与报警消息推送到钉钉；修改prometheus配置接入alertmanager并添加报警规则rules.yml

安装，启动prometheus_webhook_dingtalk

启动prometheus_webhook_dingtalk；进入安装目录；nohup ./prometheus-webhook-dingtalk --ding.profile="ops_dingding=https://oapi.dingtalk.com/robot/send?access_token=xxxxxxx" --ding.profile="dev_dingding=https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxx2" 2>&1 1>dingding.log &
说明：1、https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxx2和https://oapi.dingtalk.com/robot/send?access_token=xxxxxxx为钉钉自己创建机器人接口。webhook可惟在启动时指定多个机器人（注意在webhook中的—ding.profile命名不能相同；一个为ops_dingding；一个为dev_dingding）；

启动后默认监听8060端口；

（4）、配置alertmanager.yml并启动alertmanager服务；

alertmanager.yml

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'web.hook'
  routes:
  - receiver: 'test.yaya'
    match:
      priority: P0
    continue: true
  - receiver: 'web.hook'
    match:
      priority: P0
    continue: true
receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:8060/dingtalk/ops_dingding/send'
#inhibit_rules:
  #- source_match:
    #  severity: 'critical'
    #target_match:
    #  severity: 'warning'
    #equal: ['alertname', 'dev', 'instance']
- name: 'test.yaya'
  webhook_configs:
  - url: 'http://127.0.0.1:8060/dingtalk/dev_dingding/send'
#inhibit_rules:
#  - source_match:
#      severity: 'critical'
#    target_match:
#      severity: 'warning'
#    equal: ['alertname','dev', 'instance']

*注意routes中的报警方式test.yaya和web.hook如果没有continue:true那么在第一个报警匹配之后不会再运行后台其它匹配的报警；url为报警的prometheus_webhook_dingtalk的接口；两个不同的机器人ops_dingding和dev_dingding*

进入安装目录；运行./alertmanager --config.file=alertmanager.yml &；监控9093端口配置服务正常启动。

配置prometheus.yml接入alertmanager

global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
      - targets: ['localhost:9093']
#  - job_name: 'Alertmanager'
#    static_configs:

      - targets:
        #- alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'node_self'
    scheme: http
    #tls_config:
      #ca_file: node_exporter.crt
    static_configs:
    - targets: ['localhost:9100']

*注意rule_files指定了报警规则文件；目录默认为prometheus安装目录 *

rules.yml

groups:
- name: "服务报警测试"
  rules:
  - alert: "内存服务报警"
    expr: 100 - ((node_memory_MemAvailable_bytes * 100) / node_memory_MemTotal_bytes) > 40
    for: 1m
    labels:
      #token: {{ .Values.prometheus.prometheusSpec.externalLabels.env }}-bigdata
      priority: P0
      status: 告警
    annotations:
      description: "大数据告警：IPadress:{{$labels.instance}} 内存使用大于48%(目前使用:{{$value}}%)"
      summary: "大数据告警：CPU使用大于40%(目前使用:{{$value}}%)"

*注意runle.yml中的node_memory_MemAvailable_bytes等参数为node_exporter收集参数，更多内容请问度娘*

重启prometheus;

web打开ip:9090

报警从pending到firing话的钉钉上收到报警信息表示正常。

（5）安装grafana并图型node_export和push_gateway参数指定参数；

安装grafana（自行百度）；启动 systemctl start grafana;

登录初始用户名/密码为admin/admin;

安装后配置数据源为prometheus；下载node_exporter基本监控json文件导入granfa；可以完成node_exporter数据获取生成监控图。

接入push_gateway数据自定义监控图；

1、安装push_gateway；开启服务；监听9091

自定义监控数据获取写入push_gateway;

#!/bin/bash
avl=`free -m|grep Mem|awk '{print $NF}'`
total=`free -m|grep Mem|awk '{print $2}'`
sum=$(printf "%.3f" `echo "scale=5;${avl}/${total}"|bc`)
res=`echo "$sum * 100"|bc`
#echo ${res}%
echo "Mem_use_persent ${res}" | curl --data-binary @- http://127.0.0.1:9091/metrics/job/wx_job

jk_disk.sh  test.sh     
[root@test04 pushgateway-0.7.0.linux-amd64]# cat /wuxiao/jb/jk_disk.sh 
res=`df -h|grep -E "/$"|awk '{print $5}'|awk -F"%" '{print $1}'`
#echo ${res}
echo "disk_jk_use_persent ${res}" | curl --data-binary @- http://127.0.0.1:9091/metrics/job/jk_disk_use

配置prometheus.yml接入prometheus，并生启prometheus

global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
      - targets: ['localhost:9093']
#  - job_name: 'Alertmanager'
#    static_configs:

      - targets:
        #- alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules.yml"
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']
  - job_name: 'node_self'  
    scheme: http
    #tls_config:
      #ca_file: node_exporter.crt
    static_configs: 
    - targets: ['localhost:9100']
  - job_name: 'pushgateway' 
    static_configs:
      - targets: ['localhost:9091']
        labels:
          instance: pushgateway

登录grafana配置