一、prometheus rule
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
报警规则的配置
官方的例子:
groups:
- name: example
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
# Alert for any instance that has a median request latency >1s.
- alert: APIHighRequestLatency
expr: api_http_request_latencies_second{quantile="0.5"} > 1
for: 10m
annotations:
summary: "High request latency on {{ $labels.instance }}"
description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"
groups
:
同一个groups里的不同规则的alert与labels 不能全部相同。
如果在不同groups就没问题了, 但是prometheus发现报警的label(包括alert名称)都一样, 只会发送其中一个规则的信息。
一般来说groups只是用来分类的,如, 我把设备的放一类, 容器的放一类; 或者是,不同部门的放到不同的类。 跟具体的配置倒是没有什么关系。
rules
:
报警规则配置
alert
:
报警规则名称
expr
:
就是PromQL, 设置报警阈值的。
PromQL写起来也简单,但是有各种函数需要了解,一时半会也说不清。
for
:
等待时间, 触发报警阈值以后,会判断触发的时间是否大于for,不大于,则修改报警状态为Pending
,不发送报警信息给alertmanager。如果大于则发送报警信息给alertmanager,并且修改报警状态为Firing
。 for: 0s 表示有信息就发送,不等待。
labels
:
给发送给alertmanager的报警信息添加label。可以添加自定义的label,用于在alertmanager那里做些判断。
annotations
:
报警主体信息, 里面的summary与description没有特殊意思, 报警的时候会报出来。
$labels
与 $value
是内置变量,$labels.instance
就是表示instance标签,$value
表示expr指定的PromQL计算出来样本的值。上面例子的不好表达,看下这个:
- name: container memory is too hight
rules:
- alert: container memory is too hight
expr: container_memory_usage_bytes{namespace="qfpay"} / 1024 / 1024 > 500
for: 1m
$value
就是container_memory_usage_bytes{namespace="qfpay"} / 1024 / 1024
的值。
还有$externalLabels
表示主配置文件里的external_labels
参数指定的label。
报警内容是模板相关的设置,上面都是最简单的方式,go模板,不怎么了解,所以也是够用就行了。
在alertmanager那里发送邮件的主题与主体也是,试了几次,也是以失败告终。
配置文件
加的一些报警规则,还不全,主要用于做个记录。 里面的阈值也是随便填的,需要根据实际情况慢慢完善。
prometheus.yml
:
rule_files:
- "rules/common.yml"
- "rules/physicals.yml"
- "rules/containers.yml"
- "rules/kubernetes.yml"
- "rules/coredns.yml"
- "rules/prometheus.yml"
common.yml
[root@k8s-op prometheus]# cat rules/common.yml
groups:
- name: 发送给值班人员
rules:
- alert: instance is down
expr: up == 0
for: 0s
labels:
severity: crit
author: duty
annotations:
summary: "{{ $labels.instance }} {{ $labels.job }} 无法访问"
description: "{{ $labels.instance }} {{ $labels.job }} 无法访问啦"
- alert: 获取监控数据耗时大于1s
expr: scrape_duration_seconds{job!~"mariadb.*"} > 1
for: 0s
labels:
severity: crit
author: duty
annotations:
summary: "{{ $labels.instance }} {{ $labels.job }} 获取metrics接口耗时大于1s"
description: "{{ $labels.instance }} {{ $labels.job }} 获取metrics接口耗时大于1s。 value: {{ $value }}"
- name: 发送给部门
rules:
- alert: instance is down
expr: up == 0
for: 1m
labels:
severity: crit
author: duty
annotations:
summary: "{{ $labels.instance }} {{ $labels.job }} 无法访问"
description: "{{ $labels.instance }} {{ $labels.job }} 无法访问啦"
physicals.yml
[root@k8s-op prometheus]# cat rules/physicals.yml
groups:
- name: 发送给值班人员的
rules:
- alert: filesystem is readonly
expr: node_filesystem_readonly{fstype=~"ext4|xfs"} != 0
for: 10s
labels:
severity: error
author: duty
annotations:
summary: "Instance {{ $labels.instance }} filesystem is readonly"
description: "{{ $labels.instance }} {{ $labels.mountpoint }} filesystem is readonly"
- alert: 磁盘可用空间 < 25%
expr: (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) < 25
for: 1m
labels:
severity: warning
author: duty
annotations:
summary: "{{ $labels.instance }} 磁盘可用空间小于25%."
description: "{{ $labels.instance }}的{{ $labels.mountpoint }}分区可用空间小于25%"
# 使用了多cpu平均值,所以总共100%。
- alert: cpu 使用率 > 70%
expr: (100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[1m])) by (instance) * 100)) > 70
for: 5m
labels:
severity: warning
author: duty
annotations:
summary: "{{ $labels.instance }} cpu 使用率 > 70%"
description: "{{ $labels.instance }} cpu 使用率 > 70%,当前使用率 {{ $value }}"
- alert: 内存可用空间 < 512M
expr: node_memory_MemAvailable_bytes / 1024 / 1024 < 512
for: 1m
labels:
severity: warning
author: duty
annotations:
summary: "{{ $labels.instance }} 内存可用空间小于512Mi"
description: "{{ $labels.instance }} 内存可用空间小于512Mi, 当前{{ $value }}"
- name: 发送给部门的
rules:
- alert: filesystem is readonly
expr: node_filesystem_readonly{fstype=~"ext4|xfs"} != 0
for: 10m
labels:
severity: crit
author: op
annotations:
summary: "Instance {{ $labels.instance }} filesystem is readonly"
description: "{{ $labels.instance }} {{ $labels.mountpoint }} filesystem is readonly"
containers.yml
[root@k8s-op prometheus]# cat rules/containers.yml
groups:
- name: 发送给值班人员
rules:
- alert: container memory is too hight
expr: container_memory_usage_bytes{namespace="test"} / 1024 / 1024 > 500
for: 1m
labels:
severity: warning
author: duty
annotations:
summary: "Service {{ $labels.container }} memory usage > 500Mi, current value: {{ $value }} Mi"
kubernetes.yml
[root@k8s-op prometheus]# cat rules/kubernetes.yml
groups:
- name: 发送给值班人员的
rules:
- alert: k8s-apiserver 内存使用 > 5Gi
expr: process_resident_memory_bytes{job="k8s-apiserver"} / 1024 / 1024 > 5120
for: 5m
labels:
severity: warning
author: duty
annotations:
summary: "{{ $labels.instance }} k8s-apiserver cpu 使用大于5Gi"
description: "{{ $labels.instance }} k8s-apiserver cpu 使用大于5Gi, value: {{ $value }}"
- alert: k8s-kubelet 内存使用 > 1Gi
expr: process_resident_memory_bytes{job="k8s-node-kubelet"} / 1024 / 1024 > 1024
for: 5m
labels:
severity: warning
author: duty
annotations:
summary: "{{ $labels.instance }} k8s-kubelet cpu 使用大于1Gi"
description: "{{ $labels.instance }} k8s-kubelet cpu 使用大于1Gi, value: {{ $value }}"
- alert: k8s-proxy 内存使用 > 1Gi
expr: process_resident_memory_bytes{job="k8s-node-kubelet"} / 1024 / 1024 > 1024
for: 5m
labels:
severity: warning
author: duty
annotations:
summary: "{{ $labels.instance }} k8s-proxy cpu 使用大于1Gi"
description: "{{ $labels.instance }} k8s-proxy cpu 使用大于1Gi, value: {{ $value }}"
# 这里的cpu是 *00%, *为cpu核数量
- alert: k8s-apiserver CPU 使用 > 80%
expr: rate(process_cpu_seconds_total{job="k8s-apiserver"}[2m]) * 100 > 80
for: 5m
labels:
severity: warning
author: duty
annotations:
summary: "{{ $labels.instance }} k8s-apiserver cpu 使用大于80%"
description: "{{ $labels.instance }} k8s-apiserver cpu 使用大于80%, value: {{ $value }}"
- alert: k8s-kubelet CPU 使用 > 35%
expr: rate(process_cpu_seconds_total{job="k8s-node-kubelet"}[2m]) * 100 > 35
for: 5m
labels:
severity: warning
author: duty
annotations:
summary: "{{ $labels.instance }} k8s-kubelet cpu 使用大于35%"
description: "{{ $labels.instance }} k8s-kubelet cpu 使用大于35%, value: {{ $value }}"
- alert: k8s-proxy CPU 使用 > 15%
expr: rate(process_cpu_seconds_total{job="k8s-proxy"}[2m]) * 100 > 15
for: 5m
labels:
severity: warning
author: duty
annotations:
summary: "{{ $labels.instance }} kube-proxy cpu 使用大于15%"
description: "{{ $labels.instance }} kube-proxy cpu 使用大于15%, value: {{ $value }}"
- alert: 客户端证书还要两天就过期了
expr: apiserver_client_certificate_expiration_seconds_bucket{job="k8s-apiserver",le="172800"} != 0
for: 5m
labels:
severity: warning
author: duty
annotations:
summary: "有客户端证书要过期了"
description: "有{{ $value }} 个客户端证书还要两天过期"
- alert: 客户端证书还要七天就过期了
expr: apiserver_client_certificate_expiration_seconds_bucket{job="k8s-apiserver",le="604800"} != 0
for: 0s
labels:
severity: warning
author: duty
annotations:
summary: "有客户端证书要过期了"
description: "有{{ $value }} 个客户端证书还要七天过期"
- alert: 请求apiserver的非正常状态码请求
expr: sum(delta(apiserver_request_count{code!='0',code!~'2..'}[1m])) by (client, code) > 50
for: 10m
labels:
severity: warning
author: duty
annotations:
summary: "不正常的apiserver访问"
description: "发现不正常的apiserver访问。
value: {{ $value }};client: {{ $labels.client }}; code: {{ $labels.code }}; verb: {{ $labels.verb }}"
coredns.yml
[root@k8s-op prometheus]# cat rules/coredns.yml
groups:
- name: 发送给值班人员
rules:
- alert: CoreDNS 每秒请求次数 > 50
expr: rate(coredns_dns_request_duration_seconds_count[2m]) > 50
for: 0s
labels:
severity: crit
author: duty
annotations:
summary: "{{ $labels.instance }} 每秒请求次数 > 50"
description: "{{ $labels.instance }} 上的CoreDNS接受请求每秒大于50, value: {{ $value }}"
- alert: CoreDNS CPU 使用 > 10%
expr: rate(process_cpu_seconds_total{job="CoreDNS"}[2m]) * 100 > 10
for: 0s
labels:
severity: crit
author: duty
annotations:
summary: "{{ $labels.instance }} coredns cpu 使用大于10%"
description: "{{ $labels.instance }} coredns cpu 使用大于10%, value: {{ $value }}"
- alert: CoreDNS 内存使用>200Mi
expr: process_resident_memory_bytes{job="CoreDNS"} / 1024 / 1024 > 200
for: 0s
labels:
severity: crit
author: duty
annotations:
summary: "{{ $labels.instance }} coredns 内存使用大于200Mi"
description: "{{ $labels.instance }} coredns 内存使用大于200Mi, value: {{ $value }}"
prometheus.yml
[root@k8s-op prometheus]# cat rules/prometheus.yml
groups:
- name: 发送给值班人员
rules:
- alert: 报警规则计算时间超过5s
expr: prometheus_rule_group_last_duration_seconds > 5
for: 1m
labels:
severity: warning
author: duty
annotations:
summary: "rule_group duration > 5s"
description: "rule_group is {{ $labels.rule_group }}, last duration is {{ $value }}"
二、alertmanager配置
alertmanager的配置比prometheus简单多了。
官方配置的占位符:
Generic placeholders are defined as follows:
* `<duration>`: a duration matching the regular expression `[0-9]+(ms|[smhdwy])`
* `<labelname>`: a string matching the regular expression `[a-zA-Z_][a-zA-Z0-9_]*`
* `<labelvalue>`: a string of unicode characters
* `<filepath>`: a valid path in the current working directory
* `<boolean>`: a boolean that can take the values `true` or `false`
* `<string>`: a regular string
* `<secret>`: a regular string that is a secret, such as a password
* `<tmpl_string>`: a string which is template-expanded before usage
* `<tmpl_secret>`: a string which is template-expanded before usage that is a secret
全局配置:
global:
# The default SMTP From header field.
[ smtp_from: <tmpl_string> ]
# The default SMTP smarthost used for sending emails, including port number.
# Port number usually is 25, or 587 for SMTP over TLS (sometimes referred to as STARTTLS).
# Example: smtp.example.org:587
[ smtp_smarthost: <string> ]
# The default hostname to identify to the SMTP server.
[ smtp_hello: <string> | default = "localhost" ]
# SMTP Auth using CRAM-MD5, LOGIN and PLAIN. If empty, Alertmanager doesn't authenticate to the SMTP server.
[ smtp_auth_username: <string> ]
# SMTP Auth using LOGIN and PLAIN.
[ smtp_auth_password: <secret> ]
# SMTP Auth using PLAIN.
[ smtp_auth_identity: <string> ]
# SMTP Auth using CRAM-MD5.
[ smtp_auth_secret: <secret> ]
# The default SMTP TLS requirement.
# Note that Go does not support unencrypted connections to remote SMTP endpoints.
[ smtp_require_tls: <bool> | default = true ]
# The API URL to use for Slack notifications.
[ slack_api_url: <secret> ]
[ victorops_api_key: <secret> ]
[ victorops_api_url: <string> | default = "https://alert.victorops.com/integrations/generic/20131114/alert/" ]
[ pagerduty_url: <string> | default = "https://events.pagerduty.com/v2/enqueue" ]
[ opsgenie_api_key: <secret> ]
[ opsgenie_api_url: <string> | default = "https://api.opsgenie.com/" ]
[ hipchat_api_url: <string> | default = "https://api.hipchat.com/" ]
[ hipchat_auth_token: <secret> ]
[ wechat_api_url: <string> | default = "https://qyapi.weixin.qq.com/cgi-bin/" ]
[ wechat_api_secret: <secret> ]
[ wechat_api_corp_id: <string> ]
# The default HTTP client configuration
[ http_config: <http_config> ]
# ResolveTimeout is the default value used by alertmanager if the alert does
# not include EndsAt, after this time passes it can declare the alert as resolved if it has not been updated.
# This has no impact on alerts from Prometheus, as they always include EndsAt.
[ resolve_timeout: <duration> | default = 5m ]
# Files from which custom notification template definitions are read.
# The last component may use a wildcard matcher, e.g. 'templates/*.tmpl'.
templates:
[ - <filepath> ... ]
# The root node of the routing tree.
route: <route>
# A list of notification receivers.
receivers:
- <receiver> ...
# A list of inhibition rules.
inhibit_rules:
[ - <inhibit_rule> ... ]
global
:
global大部分都是报警介质的全局配置,如邮件,企业微信,还有一些国外用的,也不知道是什么。
报警介质的参数在receivers
里也可以单独指定,global里的相当于是默认配置。
这里只写邮件的几个参数。
smtp_from
:
就是邮件是谁发的。
smtp_smarthost
:
smtp服务器
smtp_hello
:
是smtp协议第一次交互,来标识用户信息的,一般也不用设置,smtp服务器都可以自动识别到。
smtp_auth_username
:
登录smtp服务器的用户名
smtp_auth_password
:
登录smtp服务器的密码
smtp_require_tls
:
使用ssl, 默认true。 这个参数很奇怪,测试中发现:
连接腾讯的465 ssl端口,需要设置成false。 而连接25 非ssl端口反而要设置成true。有点类似与名字应该叫smtp_skip_tls的感觉。
resolve_timeout
:
多长时间没有收到更新的报警,则表示报警已解决。
templates
:
指定报警模板文件
route
:
决定报警信息发给谁的。通过匹配一些label,判断发给receivers
配置里的谁。
receivers
:
设置报警信息的接收人。
inhibit_rules
:
报警抑制, 通过label判断哪些报警信息优先发送。比如只要有报警级别高的, 级别低的报警就先不发了。直到没有高级别的报警。跟zabbix里的报警依赖一个意思。
例子:
global:
smtp_from: noreply@***.com
smtp_smarthost: smtp.exmail.qq.com:25
smtp_auth_username: noreply@****.com
smtp_auth_password: *******
smtp_require_tls: true
receivers
:
设置报警介质
# The unique name of the receiver.
name: <string>
# Configurations for several notification integrations.
email_configs:
[ - <email_config>, ... ]
hipchat_configs:
[ - <hipchat_config>, ... ]
pagerduty_configs:
[ - <pagerduty_config>, ... ]
pushover_configs:
[ - <pushover_config>, ... ]
slack_configs:
[ - <slack_config>, ... ]
opsgenie_configs:
[ - <opsgenie_config>, ... ]
webhook_configs:
[ - <webhook_config>, ... ]
victorops_configs:
[ - <victorops_config>, ... ]
wechat_configs:
[ - <wechat_config>, ... ]
name
:
就是指定一个名称,后边route部分会调用。
email_configs
:
# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = false ]
# The email address to send notifications to.
to: <tmpl_string>
# The sender address.
[ from: <tmpl_string> | default = global.smtp_from ]
# The SMTP host through which emails are sent.
[ smarthost: <string> | default = global.smtp_smarthost ]
# The hostname to identify to the SMTP server.
[ hello: <string> | default = global.smtp_hello ]
# SMTP authentication information.
[ auth_username: <string> | default = global.smtp_auth_username ]
[ auth_password: <secret> | default = global.smtp_auth_password ]
[ auth_secret: <secret> | default = global.smtp_auth_secret ]
[ auth_identity: <string> | default = global.smtp_auth_identity ]
# The SMTP TLS requirement.
# Note that Go does not support unencrypted connections to remote SMTP endpoints.
[ require_tls: <bool> | default = global.smtp_require_tls ]
# TLS configuration.
tls_config:
[ <tls_config> ]
# The HTML body of the email notification.
[ html: <tmpl_string> | default = '{{ template "email.default.html" . }}' ]
# The text body of the email notification.
[ text: <tmpl_string> ]
# Further headers email header key/value pairs. Overrides any headers
# previously set by the notification implementation.
[ headers: { <string>: <tmpl_string>, ... } ]
send_resolved
:
是否发送报警恢复信息
to
:
发给谁
from
:
报警信息的发送人,跟global.smtp_from一个意思。
smarthost
:
smtp服务器,跟global.smtp_smarthost一个意思
hello
:
smtp协议的第一次交互, 跟global.smtp_hello一个意思。
tls_config
:
# CA certificate to validate the server certificate with.
[ ca_file: <filepath> ]
# Certificate and key files for client cert authentication to the server.
[ cert_file: <filepath> ]
[ key_file: <filepath> ]
# ServerName extension to indicate the name of the server.
# http://tools.ietf.org/html/rfc4366#section-3.1
[ server_name: <string> ]
# Disable validation of the server certificate.
[ insecure_skip_verify: <boolean> | default = false]
跟双向认证有关的, 或者是私有CA的信任。
html
text
headers
email主体与主题的模板。 在alertmanager的web页面里可以看到默认的值:
headers:
From: notify@atest.pub
Subject: '{{ template "email.default.subject" . }}'
To: yxingxing@atest.pub
html: '{{ template "email.default.html" . }}'
require_tls: true
例子:
receivers:
- name: 'email.one'
email_configs:
- to: one@atest.pub
send_resolved: true
- name: 'email.two'
email_configs:
- to: two@atest.pub
send_resolved: true
- name: 'email.op'
email_configs:
- to: op@atest.pub
send_resolved: true
route
:
发送目标的路由,还有分组功能。
[ receiver: <string> ]
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
#
# To aggregate by all possible labels use the special value '...' as the sole label name, for example:
# group_by: ['...']
# This effectively disables aggregation entirely, passing through all
# alerts as-is. This is unlikely to be what you want, unless you have
# a very low alert volume or your upstream notification system performs
# its own grouping.
[ group_by: '[' <labelname>, ... ']' ]
# Whether an alert should continue matching subsequent sibling nodes.
[ continue: <boolean> | default = false ]
# A set of equality matchers an alert has to fulfill to match the node.
match:
[ <labelname>: <labelvalue>, ... ]
# A set of regex-matchers an alert has to fulfill to match the node.
match_re:
[ <labelname>: <regex>, ... ]
# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
[ group_wait: <duration> | default = 30s ]
# How long to wait before sending a notification about new alerts that
# are added to a group of alerts for which an initial notification has
# already been sent. (Usually ~5m or more.)
[ group_interval: <duration> | default = 5m ]
# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more).
[ repeat_interval: <duration> | default = 4h ]
# Zero or more child routes.
routes:
[ - <route> ... ]
主要就是由配置与routes组成,这个配置就是默认的配置,这一阶段可以理解为主路由, routes就是子路由。 routes的配置与route一样。
报警信息先从主路由进入(主路由必须可以与所有报警匹配,不能有match),也称为主节点。 然后再匹配routes(也称为子节点)。 如果
continue
设置为false,它将在第一个匹配的routes之后停止。如果continue
在匹配的routes上为true,则警报将继续与后续的同级进行匹配。如果报警与routes都不匹配,或者说根本没有routes。则根据当前的配置参数来处理报警。
也就是说从头走到尾,子节点中相同配置覆盖主节点的。但是各个子节点的配置是不会交叉的。 而且应该是真的匹配完了才会确定配置。根据最终的配置发送报警信息。
receiver:
发送给谁,值就是 最外层配置中receiver
参数中的name。
group_by
:
实现分组。分组就是把相同组的报警信息放到同一封邮件里发送。
如果没有分组,就是一个报警一个邮件。如果几十个机器都因为相同的原因报警,几十个邮件,几百个呢。
所有指定labelname的值都相同才会分到一组。
有一个特殊的值:'...' ,就是尽可能以所有的labelname做分组,必须所有labelname的值都相同才会分组,实际意思其实就是禁用分组了。
group_wait
:
分组等待时间。
同一时间报警信息都没几个,分组的优势不明显。总有报警是陆续过来的。
所以可以等待一段时间,来尽可能的分组,然后在发送。默认 30s。根据实际情况设置,比如10s.
group_interva
:
刚发送的一组报警信息, 新的报警分组发送等待时间。也可以说只要刚发送了报警,其他的报警都要等待这个时间。应该是为了避免瞬间被报警淹没,导致一些报警看不到的情况。
repeat_interval
:
相同的警告,发送间隔时间。 这个相同,不清楚是怎么判断的。
可能是根据所有label的值是否都相同。
continue
:
介绍route的时候说到了continue。就是在匹配子节点以后,是否还要继续匹配兄弟节点。
match
:
routes里的主要配置,route这一层一般不会配置,也不应该配置。
上面说的匹配就是这个参数
如果匹配到了指定的 labelname 与 labelvalue。 就应用该子节点的配置。
match_re
:
功能给match
一样,只不过是支持正则表达式的版本。
例子:
只是检测了一下配置文件的格式, 没有具体测试功能。只是说明一下格式。
route:
receiver: 'email.op'
group_by: ['alertname']
group_wait: 10s
group_interval: 2m
repeat_interval: 30m
routes:
- receiver: 'email.k8s'
match_re:
job: k8s.*|kubernetes.*
- receiver: 'email.op'
match_re:
severity: crit|alert|emerg
- receiver: 'email.duty'
continue: true
match:
author: duty
- receiver: 'email.bigdata'
match:
job: bigdata
匹配过程:
如果job是k8s或kubernetes开头的发给email.k8s的receiver. 不是就接着往下走。
在email.duty那里有continue true, 所以还会匹配后面的bigdata, 根据匹配结果来确定配置是用duty的还是bigdata的。
inhibit_rules
:
# Matchers that have to be fulfilled in the alerts to be muted.
target_match:
[ <labelname>: <labelvalue>, ... ]
target_match_re:
[ <labelname>: <regex>, ... ]
# Matchers for which one or more alerts have to exist for the
# inhibition to take effect.
source_match:
[ <labelname>: <labelvalue>, ... ]
source_match_re:
[ <labelname>: <regex>, ... ]
# Labels that must have an equal value in the source and target
# alert for the inhibition to take effect.
[ equal: '[' <labelname>, ... ']' ]
报警抑制。
target_match
target_match_re
可以被抑制的目标。匹配到的就是可以被抑制的。只是可以,是否抑制还要下面的参数。
match不支持正则,直接等于。 match_re支持正则。
source_match
source_match_re
用来抑制target_match匹配到的报警。
总体来说,就是只要有source匹配到的报警,就不发送target匹配到的报警了。比如,发现报警级别很高的,就抑制级别低的。
但是只是这样还不够,比如都不是同一台机器的报警,还抑制个什么劲,所以还有下面这个参数。
equal
:
这个里面指定的labelname的值必须在source与target里面都相同。
比如instance, 只有是同一实例,才可以触发抑制。
例子:
inhibit_rules:
- target_match_re:
severity: warning|info|error
source_match_re:
severity: crit|alert|emerg
equal:
- instance
发现serverity标签的值可以匹配crit|alert|emerg,
就不发送serverity标签的值为warning|info|error 并且 instance标签的值都一样的报警了。
整体的配置文件
global:
smtp_from: noreply@***.com
smtp_smarthost: smtp.exmail.qq.com:25
smtp_auth_username: noreply@***.com
smtp_auth_password: ******
smtp_require_tls: true
route:
receiver: 'email.op'
group_by: ['alertname']
group_wait: 10s
group_interval: 2m
repeat_interval: 30m
routes:
- receiver: 'email.k8s'
match_re:
container: k8s.*|kubernetes.*
- receiver: 'email.op'
match_re:
severity: crit|alert|emerg
- receiver: 'email.duty'
continue: true
match:
author: duty
- receiver: 'email.bigdata'
match:
job: bigdata
receivers:
- name: 'email.k8s'
email_configs:
- to: one@test.pub
send_resolved: true
- name: 'email.duty'
email_configs:
- to: two@atest.pub
send_resolved: true
- name: 'email.bigdata'
email_configs:
- to: bigdata@atest.pub
send_resolved: true
- name: 'email.op'
email_configs:
- to: op@atest.pub
send_resolved: true
inhibit_rules:
- target_match_re:
severity: warning|info|error
source_match_re:
severity: crit|alert|emerg
equal:
- instance
三、静默
因为静默没有在配置文件里,所以单独拎出来了。
静默需要在alertmanager的web页面配置。
默认端口9093.
时间是UTC时间
还可以看到有哪些报警包含了进去。