alertmanager 配置

大番茄 2019年12月18日 1,363次浏览

一、prometheus rule

https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
报警规则的配置

官方的例子:

groups:
- name: example
  rules:
  # Alert for any instance that is unreachable for >5 minutes.
  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

  # Alert for any instance that has a median request latency >1s.
  - alert: APIHighRequestLatency
    expr: api_http_request_latencies_second{quantile="0.5"} > 1
    for: 10m
    annotations:
      summary: "High request latency on {{ $labels.instance }}"
      description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"

groups:

同一个groups里的不同规则的alert与labels 不能全部相同。
如果在不同groups就没问题了, 但是prometheus发现报警的label(包括alert名称)都一样, 只会发送其中一个规则的信息。
一般来说groups只是用来分类的,如, 我把设备的放一类, 容器的放一类; 或者是,不同部门的放到不同的类。 跟具体的配置倒是没有什么关系。

rules:

报警规则配置

alert:

报警规则名称

expr:

就是PromQL, 设置报警阈值的。
PromQL写起来也简单,但是有各种函数需要了解,一时半会也说不清。

for:

等待时间, 触发报警阈值以后,会判断触发的时间是否大于for,不大于,则修改报警状态为Pending,不发送报警信息给alertmanager。如果大于则发送报警信息给alertmanager,并且修改报警状态为Firing。 for: 0s 表示有信息就发送,不等待。

labels

给发送给alertmanager的报警信息添加label。可以添加自定义的label,用于在alertmanager那里做些判断。

annotations

报警主体信息, 里面的summary与description没有特殊意思, 报警的时候会报出来。
$labels$value 是内置变量,$labels.instance就是表示instance标签,$value表示expr指定的PromQL计算出来样本的值。上面例子的不好表达,看下这个:

- name: container memory is too hight
  rules:
  - alert: container memory is too hight
    expr: container_memory_usage_bytes{namespace="qfpay"} / 1024 / 1024 > 500
    for: 1m

$value就是container_memory_usage_bytes{namespace="qfpay"} / 1024 / 1024的值。

还有$externalLabels表示主配置文件里的external_labels参数指定的label。

报警内容是模板相关的设置,上面都是最简单的方式,go模板,不怎么了解,所以也是够用就行了。
在alertmanager那里发送邮件的主题与主体也是,试了几次,也是以失败告终。


配置文件

加的一些报警规则,还不全,主要用于做个记录。 里面的阈值也是随便填的,需要根据实际情况慢慢完善。

prometheus.yml:

rule_files:
   - "rules/common.yml"
   - "rules/physicals.yml"
   - "rules/containers.yml"
   - "rules/kubernetes.yml"
   - "rules/coredns.yml"
   - "rules/prometheus.yml"
common.yml
[root@k8s-op prometheus]# cat rules/common.yml
groups:
- name: 发送给值班人员
  rules:
  - alert: instance is down
    expr: up == 0
    for: 0s
    labels:
      severity: crit
      author: duty
    annotations:
      summary: "{{ $labels.instance }} {{ $labels.job }} 无法访问"
      description: "{{ $labels.instance }} {{ $labels.job }} 无法访问啦"

  - alert: 获取监控数据耗时大于1s
    expr: scrape_duration_seconds{job!~"mariadb.*"} > 1
    for: 0s
    labels:
      severity: crit
      author: duty
    annotations:
      summary: "{{ $labels.instance }} {{ $labels.job }} 获取metrics接口耗时大于1s"
      description: "{{ $labels.instance }} {{ $labels.job }} 获取metrics接口耗时大于1s。 value: {{ $value }}"


- name: 发送给部门
  rules:
  - alert: instance is down
    expr: up == 0
    for: 1m
    labels:
      severity: crit
      author: duty
    annotations:
      summary: "{{ $labels.instance }} {{ $labels.job }} 无法访问"
      description: "{{ $labels.instance }} {{ $labels.job }} 无法访问啦"

physicals.yml
[root@k8s-op prometheus]# cat rules/physicals.yml
groups:
- name: 发送给值班人员的
  rules:
  - alert: filesystem is readonly
    expr: node_filesystem_readonly{fstype=~"ext4|xfs"} != 0
    for: 10s
    labels:
      severity: error
      author: duty
    annotations:
      summary: "Instance {{ $labels.instance }} filesystem is readonly"
      description: "{{ $labels.instance }} {{ $labels.mountpoint }} filesystem is readonly"

  - alert: 磁盘可用空间 < 25%
    expr: (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) < 25
    for: 1m
    labels:
      severity: warning
      author: duty
    annotations:
      summary: "{{ $labels.instance }} 磁盘可用空间小于25%."
      description: "{{ $labels.instance }}的{{ $labels.mountpoint }}分区可用空间小于25%"

# 使用了多cpu平均值,所以总共100%。
  - alert: cpu 使用率 > 70%
    expr: (100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[1m])) by (instance) * 100)) > 70
    for: 5m
    labels:
      severity: warning
      author: duty
    annotations:
      summary: "{{ $labels.instance }} cpu 使用率 > 70%"
      description: "{{ $labels.instance }} cpu 使用率 > 70%,当前使用率 {{ $value }}"

  - alert: 内存可用空间 < 512M
    expr: node_memory_MemAvailable_bytes / 1024 / 1024 < 512
    for: 1m
    labels:
      severity: warning
      author: duty
    annotations:
      summary: "{{ $labels.instance }} 内存可用空间小于512Mi"
      description: "{{ $labels.instance }} 内存可用空间小于512Mi, 当前{{ $value }}"



- name: 发送给部门的
  rules:
  - alert: filesystem is readonly
    expr: node_filesystem_readonly{fstype=~"ext4|xfs"} != 0
    for: 10m
    labels:
      severity: crit
      author: op
    annotations:
      summary: "Instance {{ $labels.instance }} filesystem is readonly"
      description: "{{ $labels.instance }} {{ $labels.mountpoint }} filesystem is readonly"
containers.yml
[root@k8s-op prometheus]# cat rules/containers.yml
groups:
- name: 发送给值班人员
  rules:
  - alert: container memory is too hight
    expr: container_memory_usage_bytes{namespace="test"} / 1024 / 1024 > 500
    for: 1m
    labels:
      severity: warning
      author: duty
    annotations:
      summary: "Service {{ $labels.container }} memory usage > 500Mi, current value: {{ $value }} Mi"
kubernetes.yml
[root@k8s-op prometheus]# cat rules/kubernetes.yml
groups:
- name: 发送给值班人员的
  rules:
  - alert:  k8s-apiserver 内存使用 > 5Gi
    expr: process_resident_memory_bytes{job="k8s-apiserver"} / 1024 / 1024 > 5120
    for: 5m
    labels:
      severity: warning
      author: duty
    annotations:
      summary: "{{ $labels.instance }} k8s-apiserver cpu 使用大于5Gi"
      description: "{{ $labels.instance }} k8s-apiserver cpu 使用大于5Gi, value: {{ $value }}"

  - alert:  k8s-kubelet 内存使用 > 1Gi
    expr: process_resident_memory_bytes{job="k8s-node-kubelet"} / 1024 / 1024 > 1024
    for: 5m
    labels:
      severity: warning
      author: duty
    annotations:
      summary: "{{ $labels.instance }} k8s-kubelet cpu 使用大于1Gi"
      description: "{{ $labels.instance }} k8s-kubelet cpu 使用大于1Gi, value: {{ $value }}"

  - alert:  k8s-proxy 内存使用 > 1Gi
    expr: process_resident_memory_bytes{job="k8s-node-kubelet"} / 1024 / 1024 > 1024
    for: 5m
    labels:
      severity: warning
      author: duty
    annotations:
      summary: "{{ $labels.instance }} k8s-proxy cpu 使用大于1Gi"
      description: "{{ $labels.instance }} k8s-proxy cpu 使用大于1Gi, value: {{ $value }}"

# 这里的cpu是 *00%,   *为cpu核数量
  - alert: k8s-apiserver CPU 使用 > 80%
    expr: rate(process_cpu_seconds_total{job="k8s-apiserver"}[2m]) * 100 > 80
    for: 5m
    labels:
      severity: warning
      author: duty
    annotations:
      summary: "{{ $labels.instance }} k8s-apiserver cpu 使用大于80%"
      description: "{{ $labels.instance }} k8s-apiserver cpu 使用大于80%, value: {{ $value }}"

  - alert: k8s-kubelet CPU 使用 > 35%
    expr: rate(process_cpu_seconds_total{job="k8s-node-kubelet"}[2m]) * 100 > 35
    for: 5m
    labels:
      severity: warning
      author: duty
    annotations:
      summary: "{{ $labels.instance }} k8s-kubelet cpu 使用大于35%"
      description: "{{ $labels.instance }} k8s-kubelet cpu 使用大于35%, value: {{ $value }}"

  - alert: k8s-proxy CPU 使用 > 15%
    expr: rate(process_cpu_seconds_total{job="k8s-proxy"}[2m]) * 100 > 15
    for: 5m
    labels:
      severity: warning
      author: duty
    annotations:
      summary: "{{ $labels.instance }} kube-proxy cpu 使用大于15%"
      description: "{{ $labels.instance }} kube-proxy cpu 使用大于15%, value: {{ $value }}"

  - alert: 客户端证书还要两天就过期了
    expr: apiserver_client_certificate_expiration_seconds_bucket{job="k8s-apiserver",le="172800"} != 0
    for: 5m
    labels:
      severity: warning
      author: duty
    annotations:
      summary: "有客户端证书要过期了"
      description: "有{{ $value }} 个客户端证书还要两天过期"

  - alert: 客户端证书还要七天就过期了
    expr: apiserver_client_certificate_expiration_seconds_bucket{job="k8s-apiserver",le="604800"} != 0
    for: 0s
    labels:
      severity: warning
      author: duty
    annotations:
      summary: "有客户端证书要过期了"
      description: "有{{ $value }} 个客户端证书还要七天过期"

  - alert: 请求apiserver的非正常状态码请求
    expr: sum(delta(apiserver_request_count{code!='0',code!~'2..'}[1m])) by (client, code) > 50
    for: 10m
    labels:
      severity: warning
      author: duty
    annotations:
      summary: "不正常的apiserver访问"
      description: "发现不正常的apiserver访问。
                    value: {{ $value }};client: {{ $labels.client }}; code: {{ $labels.code }}; verb: {{ $labels.verb }}"
coredns.yml
[root@k8s-op prometheus]# cat rules/coredns.yml
groups:
- name: 发送给值班人员
  rules:
  - alert: CoreDNS 每秒请求次数 > 50
    expr: rate(coredns_dns_request_duration_seconds_count[2m]) > 50
    for: 0s
    labels:
      severity: crit
      author: duty
    annotations:
      summary: "{{ $labels.instance }} 每秒请求次数 > 50"
      description: "{{ $labels.instance }} 上的CoreDNS接受请求每秒大于50, value: {{ $value }}"

  - alert: CoreDNS CPU 使用 > 10%
    expr: rate(process_cpu_seconds_total{job="CoreDNS"}[2m]) * 100 > 10
    for: 0s
    labels:
      severity: crit
      author: duty
    annotations:
      summary: "{{ $labels.instance }} coredns cpu 使用大于10%"
      description: "{{ $labels.instance }} coredns cpu 使用大于10%, value: {{ $value }}"

  - alert: CoreDNS 内存使用>200Mi
    expr: process_resident_memory_bytes{job="CoreDNS"} / 1024 / 1024 > 200
    for: 0s
    labels:
      severity: crit
      author: duty
    annotations:
      summary: "{{ $labels.instance }} coredns 内存使用大于200Mi"
      description: "{{ $labels.instance }} coredns 内存使用大于200Mi, value: {{ $value }}"

prometheus.yml
[root@k8s-op prometheus]# cat rules/prometheus.yml
groups:
- name: 发送给值班人员
  rules:
  - alert: 报警规则计算时间超过5s
    expr: prometheus_rule_group_last_duration_seconds > 5
    for: 1m
    labels:
      severity: warning
      author: duty
    annotations:
      summary: "rule_group duration > 5s"
      description: "rule_group is {{ $labels.rule_group }}, last duration is {{ $value }}"

二、alertmanager配置

alertmanager的配置比prometheus简单多了。
官方配置的占位符:

Generic placeholders are defined as follows:

* `<duration>`: a duration matching the regular expression `[0-9]+(ms|[smhdwy])`
* `<labelname>`: a string matching the regular expression `[a-zA-Z_][a-zA-Z0-9_]*`
* `<labelvalue>`: a string of unicode characters
* `<filepath>`: a valid path in the current working directory
* `<boolean>`: a boolean that can take the values `true` or `false`
* `<string>`: a regular string
* `<secret>`: a regular string that is a secret, such as a password
* `<tmpl_string>`: a string which is template-expanded before usage
* `<tmpl_secret>`: a string which is template-expanded before usage that is a secret

全局配置:

global:
  # The default SMTP From header field.
  [ smtp_from: <tmpl_string> ]
  # The default SMTP smarthost used for sending emails, including port number.
  # Port number usually is 25, or 587 for SMTP over TLS (sometimes referred to as STARTTLS).
  # Example: smtp.example.org:587
  [ smtp_smarthost: <string> ]
  # The default hostname to identify to the SMTP server.
  [ smtp_hello: <string> | default = "localhost" ]
  # SMTP Auth using CRAM-MD5, LOGIN and PLAIN. If empty, Alertmanager doesn't authenticate to the SMTP server.
  [ smtp_auth_username: <string> ]
  # SMTP Auth using LOGIN and PLAIN.
  [ smtp_auth_password: <secret> ]
  # SMTP Auth using PLAIN.
  [ smtp_auth_identity: <string> ]
  # SMTP Auth using CRAM-MD5. 
  [ smtp_auth_secret: <secret> ]
  # The default SMTP TLS requirement. 
  # Note that Go does not support unencrypted connections to remote SMTP endpoints.
  [ smtp_require_tls: <bool> | default = true ]

  # The API URL to use for Slack notifications.
  [ slack_api_url: <secret> ]
  [ victorops_api_key: <secret> ]
  [ victorops_api_url: <string> | default = "https://alert.victorops.com/integrations/generic/20131114/alert/" ]
  [ pagerduty_url: <string> | default = "https://events.pagerduty.com/v2/enqueue" ]
  [ opsgenie_api_key: <secret> ]
  [ opsgenie_api_url: <string> | default = "https://api.opsgenie.com/" ]
  [ hipchat_api_url: <string> | default = "https://api.hipchat.com/" ]
  [ hipchat_auth_token: <secret> ]
  [ wechat_api_url: <string> | default = "https://qyapi.weixin.qq.com/cgi-bin/" ]
  [ wechat_api_secret: <secret> ]
  [ wechat_api_corp_id: <string> ]

  # The default HTTP client configuration
  [ http_config: <http_config> ]

  # ResolveTimeout is the default value used by alertmanager if the alert does
  # not include EndsAt, after this time passes it can declare the alert as resolved if it has not been updated.
  # This has no impact on alerts from Prometheus, as they always include EndsAt.
  [ resolve_timeout: <duration> | default = 5m ]

# Files from which custom notification template definitions are read.
# The last component may use a wildcard matcher, e.g. 'templates/*.tmpl'.
templates:
  [ - <filepath> ... ]

# The root node of the routing tree.
route: <route>

# A list of notification receivers.
receivers:
  - <receiver> ...

# A list of inhibition rules.
inhibit_rules:
  [ - <inhibit_rule> ... ]

global:

global大部分都是报警介质的全局配置,如邮件,企业微信,还有一些国外用的,也不知道是什么。
报警介质的参数在receivers里也可以单独指定,global里的相当于是默认配置。

这里只写邮件的几个参数。

smtp_from:

就是邮件是谁发的。

smtp_smarthost:

smtp服务器

smtp_hello:

是smtp协议第一次交互,来标识用户信息的,一般也不用设置,smtp服务器都可以自动识别到。

smtp_auth_username

登录smtp服务器的用户名

smtp_auth_password:

登录smtp服务器的密码

smtp_require_tls:

使用ssl, 默认true。 这个参数很奇怪,测试中发现:
连接腾讯的465 ssl端口,需要设置成false。 而连接25 非ssl端口反而要设置成true。有点类似与名字应该叫smtp_skip_tls的感觉。

resolve_timeout:

多长时间没有收到更新的报警,则表示报警已解决。

templates

指定报警模板文件

route:

决定报警信息发给谁的。通过匹配一些label,判断发给receivers配置里的谁。

receivers

设置报警信息的接收人。

inhibit_rules:

报警抑制, 通过label判断哪些报警信息优先发送。比如只要有报警级别高的, 级别低的报警就先不发了。直到没有高级别的报警。跟zabbix里的报警依赖一个意思。

例子:

global:
  smtp_from: noreply@***.com
  smtp_smarthost: smtp.exmail.qq.com:25
  smtp_auth_username: noreply@****.com
  smtp_auth_password: *******
  smtp_require_tls: true

receivers:

设置报警介质

# The unique name of the receiver.
name: <string>

# Configurations for several notification integrations.
email_configs:
  [ - <email_config>, ... ]
hipchat_configs:
  [ - <hipchat_config>, ... ]
pagerduty_configs:
  [ - <pagerduty_config>, ... ]
pushover_configs:
  [ - <pushover_config>, ... ]
slack_configs:
  [ - <slack_config>, ... ]
opsgenie_configs:
  [ - <opsgenie_config>, ... ]
webhook_configs:
  [ - <webhook_config>, ... ]
victorops_configs:
  [ - <victorops_config>, ... ]
wechat_configs:
  [ - <wechat_config>, ... ]

name

就是指定一个名称,后边route部分会调用。

email_configs:

# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = false ]

# The email address to send notifications to.
to: <tmpl_string>

# The sender address.
[ from: <tmpl_string> | default = global.smtp_from ]

# The SMTP host through which emails are sent.
[ smarthost: <string> | default = global.smtp_smarthost ]

# The hostname to identify to the SMTP server.
[ hello: <string> | default = global.smtp_hello ]

# SMTP authentication information.
[ auth_username: <string> | default = global.smtp_auth_username ]
[ auth_password: <secret> | default = global.smtp_auth_password ]
[ auth_secret: <secret> | default = global.smtp_auth_secret ]
[ auth_identity: <string> | default = global.smtp_auth_identity ]

# The SMTP TLS requirement.
# Note that Go does not support unencrypted connections to remote SMTP endpoints.
[ require_tls: <bool> | default = global.smtp_require_tls ]

# TLS configuration.
tls_config:
  [ <tls_config> ]

# The HTML body of the email notification.
[ html: <tmpl_string> | default = '{{ template "email.default.html" . }}' ]
# The text body of the email notification.
[ text: <tmpl_string> ]

# Further headers email header key/value pairs. Overrides any headers
# previously set by the notification implementation.
[ headers: { <string>: <tmpl_string>, ... } ]
send_resolved:

是否发送报警恢复信息

to:

发给谁

from:

报警信息的发送人,跟global.smtp_from一个意思。

smarthost:

smtp服务器,跟global.smtp_smarthost一个意思

hello:

smtp协议的第一次交互, 跟global.smtp_hello一个意思。

tls_config:
# CA certificate to validate the server certificate with.
[ ca_file: <filepath> ]

# Certificate and key files for client cert authentication to the server.
[ cert_file: <filepath> ]
[ key_file: <filepath> ]

# ServerName extension to indicate the name of the server.
# http://tools.ietf.org/html/rfc4366#section-3.1
[ server_name: <string> ]

# Disable validation of the server certificate.
[ insecure_skip_verify: <boolean> | default = false]

跟双向认证有关的, 或者是私有CA的信任。

html text headers

email主体与主题的模板。 在alertmanager的web页面里可以看到默认的值:

    headers:
      From: notify@atest.pub
      Subject: '{{ template "email.default.subject" . }}'
      To: yxingxing@atest.pub
    html: '{{ template "email.default.html" . }}'
    require_tls: true

例子:

receivers:
- name: 'email.one'
  email_configs:
  - to: one@atest.pub
    send_resolved: true

- name: 'email.two'
  email_configs:
  - to: two@atest.pub
    send_resolved: true

- name: 'email.op'
  email_configs:
  - to: op@atest.pub
    send_resolved: true


route

发送目标的路由,还有分组功能。

[ receiver: <string> ]
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
#
# To aggregate by all possible labels use the special value '...' as the sole label name, for example:
# group_by: ['...'] 
# This effectively disables aggregation entirely, passing through all 
# alerts as-is. This is unlikely to be what you want, unless you have 
# a very low alert volume or your upstream notification system performs 
# its own grouping.
[ group_by: '[' <labelname>, ... ']' ]

# Whether an alert should continue matching subsequent sibling nodes.
[ continue: <boolean> | default = false ]

# A set of equality matchers an alert has to fulfill to match the node.
match:
  [ <labelname>: <labelvalue>, ... ]

# A set of regex-matchers an alert has to fulfill to match the node.
match_re:
  [ <labelname>: <regex>, ... ]

# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
[ group_wait: <duration> | default = 30s ]

# How long to wait before sending a notification about new alerts that
# are added to a group of alerts for which an initial notification has
# already been sent. (Usually ~5m or more.)
[ group_interval: <duration> | default = 5m ]

# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more).
[ repeat_interval: <duration> | default = 4h ]

# Zero or more child routes.
routes:
  [ - <route> ... ]

主要就是由配置与routes组成,这个配置就是默认的配置,这一阶段可以理解为主路由, routes就是子路由。 routes的配置与route一样。

报警信息先从主路由进入(主路由必须可以与所有报警匹配,不能有match),也称为主节点。 然后再匹配routes(也称为子节点)。 如果continue设置为false,它将在第一个匹配的routes之后停止。如果continue在匹配的routes上为true,则警报将继续与后续的同级进行匹配。如果报警与routes都不匹配,或者说根本没有routes。则根据当前的配置参数来处理报警。

也就是说从头走到尾,子节点中相同配置覆盖主节点的。但是各个子节点的配置是不会交叉的。 而且应该是真的匹配完了才会确定配置。根据最终的配置发送报警信息。

receiver:

发送给谁,值就是 最外层配置中receiver参数中的name。

group_by:

实现分组。分组就是把相同组的报警信息放到同一封邮件里发送。
如果没有分组,就是一个报警一个邮件。如果几十个机器都因为相同的原因报警,几十个邮件,几百个呢。

所有指定labelname的值都相同才会分到一组。

有一个特殊的值:'...' ,就是尽可能以所有的labelname做分组,必须所有labelname的值都相同才会分组,实际意思其实就是禁用分组了。

group_wait

分组等待时间。
同一时间报警信息都没几个,分组的优势不明显。总有报警是陆续过来的。
所以可以等待一段时间,来尽可能的分组,然后在发送。默认 30s。根据实际情况设置,比如10s.

group_interva

刚发送的一组报警信息, 新的报警分组发送等待时间。也可以说只要刚发送了报警,其他的报警都要等待这个时间。应该是为了避免瞬间被报警淹没,导致一些报警看不到的情况。

repeat_interval

相同的警告,发送间隔时间。 这个相同,不清楚是怎么判断的。
可能是根据所有label的值是否都相同。

continue

介绍route的时候说到了continue。就是在匹配子节点以后,是否还要继续匹配兄弟节点。

match:

routes里的主要配置,route这一层一般不会配置,也不应该配置。
上面说的匹配就是这个参数
如果匹配到了指定的 labelname 与 labelvalue。 就应用该子节点的配置。

match_re:

功能给match一样,只不过是支持正则表达式的版本。

例子:

只是检测了一下配置文件的格式, 没有具体测试功能。只是说明一下格式。

route:
  receiver: 'email.op'
  group_by: ['alertname']
  group_wait: 10s 
  group_interval: 2m
  repeat_interval: 30m 
  routes:
  - receiver: 'email.k8s'
    match_re:
      job: k8s.*|kubernetes.*

  - receiver: 'email.op'
    match_re:
      severity: crit|alert|emerg 

  - receiver: 'email.duty'
    continue: true
    match:
      author: duty

  - receiver: 'email.bigdata'
    match:
      job: bigdata

匹配过程:
如果job是k8s或kubernetes开头的发给email.k8s的receiver. 不是就接着往下走。
在email.duty那里有continue true, 所以还会匹配后面的bigdata, 根据匹配结果来确定配置是用duty的还是bigdata的。

inhibit_rules

# Matchers that have to be fulfilled in the alerts to be muted.
target_match:
  [ <labelname>: <labelvalue>, ... ]
target_match_re:
  [ <labelname>: <regex>, ... ]

# Matchers for which one or more alerts have to exist for the
# inhibition to take effect.
source_match:
  [ <labelname>: <labelvalue>, ... ]
source_match_re:
  [ <labelname>: <regex>, ... ]

# Labels that must have an equal value in the source and target
# alert for the inhibition to take effect.
[ equal: '[' <labelname>, ... ']' ]

报警抑制。

target_match target_match_re

可以被抑制的目标。匹配到的就是可以被抑制的。只是可以,是否抑制还要下面的参数。
match不支持正则,直接等于。 match_re支持正则。

source_match source_match_re

用来抑制target_match匹配到的报警。
总体来说,就是只要有source匹配到的报警,就不发送target匹配到的报警了。比如,发现报警级别很高的,就抑制级别低的。
但是只是这样还不够,比如都不是同一台机器的报警,还抑制个什么劲,所以还有下面这个参数。

equal:

这个里面指定的labelname的值必须在source与target里面都相同。
比如instance, 只有是同一实例,才可以触发抑制。

例子:

inhibit_rules:
  - target_match_re:
      severity: warning|info|error
    source_match_re:
      severity: crit|alert|emerg
    equal:
    - instance

发现serverity标签的值可以匹配crit|alert|emerg,
就不发送serverity标签的值为warning|info|error 并且 instance标签的值都一样的报警了。


整体的配置文件

global:
  smtp_from: noreply@***.com
  smtp_smarthost: smtp.exmail.qq.com:25
  smtp_auth_username: noreply@***.com
  smtp_auth_password: ******
  smtp_require_tls: true
route:
  receiver: 'email.op'
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 2m
  repeat_interval: 30m
  routes:
  - receiver: 'email.k8s'
    match_re:
      container: k8s.*|kubernetes.*

  - receiver: 'email.op'
    match_re:
      severity: crit|alert|emerg 

  - receiver: 'email.duty'
    continue: true
    match:
      author: duty

  - receiver: 'email.bigdata'
    match:
      job: bigdata

  
receivers:
- name: 'email.k8s'
  email_configs:
  - to: one@test.pub
    send_resolved: true

- name: 'email.duty'
  email_configs:
  - to: two@atest.pub
    send_resolved: true

- name: 'email.bigdata'
  email_configs:
  - to: bigdata@atest.pub
    send_resolved: true

- name: 'email.op'
  email_configs:
  - to: op@atest.pub
    send_resolved: true

inhibit_rules:
  - target_match_re:
      severity: warning|info|error
    source_match_re:
      severity: crit|alert|emerg
    equal: 
    - instance

三、静默

因为静默没有在配置文件里,所以单独拎出来了。
静默需要在alertmanager的web页面配置。
默认端口9093.

image-08b4d3ce


image-28eab75b
时间是UTC时间


image-5a9a5173

还可以看到有哪些报警包含了进去。