Suppressing Informational Alerts with Prometheus and AlertManager
I’ve recently started devops-ing at a new company, and one of the tasks set to me was to bring the monitoring and alerting systems up to snuff. The team already had a solid config, but in their overworkedness some of the less-pressing tasks like keeping the monitors in line had fallen by the wayside.
One of the first things that became apparent was that somewhere along the line the version of node_exporter
that they were deploying had changed from 0.14.0
to 0.16.0
which included a fair number of metrics names being changed, and holes appearing in various graphs. My naieve fix was “I’ll just make a Prometheus rule for node_exporter_build_info{version != 0.16.0}
and our Slack channel lit up with dozens of alerts. I quickly learned that setting a label of severity: info
didn’t actually do anything and set about looking for something that actually worked.
Now it’s important to note that while we don’t want to be notified about these alerts, do want to them exist and be viewable in various dashboards. Otherwise I might have actually agreed with the AlertManager devs’ glib “don’t generate alerts for things you don’t want to see” responses.
1. The “Bad” Idea
The first thing I did was to Google “alertmanager suppress alert” which led me straight to Inhibitions where I immediately did the thing they tell you not to: make an alert inhibit itself.
inhibit_rules:
- source_match:
severity: 'info'
target_match:
severity: 'info'
equal: ['alertname']
Just because you shouldn’t doesn’t mean that you can’t, right?
Well, in this case it does. Apparently the AlertManager devs have decreed that it’s unreasonable to allow people to write bad config and shoot themselves in the foot, and consequently hard-coded the product to ignore this config.
IMHO this is not so big of a footgun to require a change like this at all, let alone a hard-coded one. I would much rather have seen this implemented as a config option that defaults to disallow the problematic behaviour, but can be changed to accommodate people that want it.
2. The Yet-Unforbidden Idea
After the requisite period of loudly complaining on IRC someone named tzz suggested that I “route them to an empty receiver”, on which point the AlertManager docs are thoroughly unhelpful. If it weren’t for beorn7’s comment on the denied PR for a proper blackhole receiver I’d have had no idea that you could simply omit the receiver config to accomplish the same goal.
route:
receiver: slack_general
routes:
- receiver: blackhole
match:
severity: info
- receiver: slack_generalreceivers:
- name: slack_general
slack_configs:
# ...
- name: blackhole
3. Success!
Info alerts stopped generating Slack messages, and everything else continued as normal.
Contrary to the warnings of the AlertManager devs I’ve yet to observe either mass hysteria, nor dogs and cats living together.
Hope I’ve been of help to you!