Metodologías Tecnología

alerting alertmanager on-call oncall pagerduty prometheus

Alertmanager: Routing That Doesn’t Wake Your Team at 3am

August 30, 2024 14 min read 67 reads

Table of contents

Key takeaways
The starting problem
Correct anatomy
The routing tree as mental map
Grouping: the fundamental trade-off
Inhibition: say the obvious once
Silences, time intervals and human rhythm
Rotations, escalation and the antidote to fatigue
Conclusion

Actualizado: 2026-05-03

Alertmanager, the notification piece of the Prometheus ecosystem, is where a malformed alert becomes a three-in-the-morning PagerDuty cascade — or where, well-handled, it preserves the team’s sanity. The difference isn’t in the alerting engine, it’s in the configuration that surrounds it. After years looking at real setups, the uncomfortable conclusion is that almost nobody has Alertmanager properly tuned. This article covers the patterns that actually work in production on version 0.27 with Prometheus 2.54.

Key takeaways

A single Slack receiver with no grouping or inhibitions is the naive configuration that leads to an ignored channel within a week.
The routing tree should be designed from most specific to most general — every branch added without criteria is observability technical debt.
Inhibition rules are the most underused tool and the highest-impact one during regional incidents.
A critical alert is one that justifies waking someone up: forcing that explicit definition changes team behaviour.
The quarterly signal-vs-noise review is worth more than any new alerting rule.

The starting problem

The naive deployment is a single Slack receiver that eats every alert, with no grouping, no severity classification and no inhibition. The outcome shows up within a week: the channel gets ignored by inertia, real alerts drown in the noise, and when a genuine incident lands nobody notices until a customer calls. Alert fatigue isn’t an academic concept; it’s a concrete operational failure that shows up in mean time to detect.

The conceptual mistake is treating alerts as independent events. In practice, a node going down generates dozens of correlated alerts and a regional incident can trigger hundreds in seconds. Without a structure that classifies, groups and prioritises them, the Alertmanager console becomes an unreadable stream.

Correct anatomy

A healthy configuration rests on six elements that work together:

Routing tree: decides which receiver handles each alert based on its labels.
Grouping: combines related alerts into one notification.
Inhibition rules: silence effects when the cause is already known.
Silences: carve out noise during one-off maintenance windows.
Severity channels: separate what interrupts sleep from what waits until business hours.
On-call rotation: guarantees the notification reaches the right person.

None solves the problem on its own. What is interesting is how they interact: grouping reduces volume, inhibition removes redundancy, routing directs the filtered flow, and silences are the escape valve for planned work.

The routing tree as mental map

The routing tree is Alertmanager’s heart. Conceptually it’s a recursive decision tree where each alert descends from the root testing label matches, and the first matching node wins — unless explicitly marked to continue. The rule of thumb: design from most specific to most general, keeping the default route for whatever doesn’t fit any pattern.

yaml

route:
  group_by: [alertname, cluster, service]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: default-slack
  routes:
    - match: { severity: critical }
      receiver: pagerduty-oncall
      group_wait: 10s
      continue: true
    - match_re: { service: ^(postgres|mysql|redis)$ }
      receiver: dba-slack
    - match: { severity: warning }
      receiver: jira-tickets
      group_interval: 30m

The critical branch fires towards PagerDuty with a short group_wait and continues to Slack to leave a visible trail. Database alerts divert to a dedicated DBA receiver. Intermediate-severity notices generate Jira tickets with a larger group interval — nobody needs a fresh ticket every five minutes. Informational telemetry only emits during business hours. This tree complements the Prometheus rules described in SLOs and error budgets with Prometheus: Alertmanager routing is only as good as the alerts it receives.

Grouping: the fundamental trade-off

Grouping is controlled by three parameters that deserve to be understood as deliberate tension:

group_wait: time Alertmanager waits before sending the first notification for a new group. Low values speed up detection but fragment the message.
group_interval: how often new alerts join an existing group.
repeat_interval: how often a group gets resent while it stays active.

The core design trade-off is clear: aggressive grouping reduces volume and fatigue but can delay detection of symptoms that would warrant immediate attention. Fine-grained grouping stays closer to each alert’s origin but turns a large incident into an unmanageable torrent. In practice, grouping by alertname, cluster and service works well for most fleets: enough context to be readable and enough granularity not to hide distinct problems inside the same message.

Inhibition: say the obvious once

When a node goes down, alerts for the pods living on it add no new information — they’re direct consequences of the already-known cause. Inhibition rules express exactly that: if alert A is active, silence alerts B that share certain labels.

The useful mental model is distinguishing cause alerts from effect alerts:

Cause: the node fell, the network link dropped, the database stopped accepting connections.
Effect: pods can’t start, requests are failing, jobs can’t connect.

During a big incident, whoever is on call needs to see causes, not a fifty-item list of effects. Solid rule: if an alert can be deduced from another active one, it probably ought to be inhibited.

Silences, time intervals and human rhythm

Temporary silences (managed from the UI or with amtool) are the mechanism for one-off maintenance windows. Time intervals, now mature in recent versions, let the configuration itself express that certain alerts only fire in business hours or that informational ones stay muted on weekends.

Distinguishing the two is useful: silences document exceptions, intervals encode stable policy.

A policy that works in small teams:

Critical: always pages.
Warning: only generates a ticket during business hours.
Informational: never interrupts.

This isn’t rigidity — it’s respect for other people’s sleep. And, above all, it forces the criterion for labelling something critical to be explicit: a critical alert is one that justifies waking someone up. If it doesn’t justify that, it isn’t critical.

Rotations, escalation and the antidote to fatigue

Alertmanager doesn’t handle rotations; that responsibility falls to PagerDuty or OpsGenie, which know who’s on call, apply escalation policies when the primary doesn’t acknowledge within X minutes, and maintain the calendar. Alertmanager delivers the alert to the team; the external tool delivers it to the person. This separation avoids reinventing the wheel.

The real antidote to fatigue, though, isn’t more tooling — it’s periodic review. Monthly, it pays to look at:

How many alerts fired.
How many were acknowledged without action.
How many were manually silenced.

A high manual-silence rate signals miscalibrated alerts. A low acknowledgement rate signals a channel the team has already tuned out. Both signals are tractable once they’re being measured. For the broader incident response context, see blameless incident response.

Conclusion

A well-configured Alertmanager is the difference between an on-call team that sleeps and one that quits. None of the patterns described solves the problem alone — grouping, inhibition, severity routing, time intervals — but combined they build a sustainable experience. The investment is worth it: every hour saved from alert fatigue turns into productivity and, more importantly, into people who still want to be on call next year. To start from scratch, kube-prometheus-stack gives a reasonable base to iterate from. For established teams, the quarterly signal-versus-noise review is probably the best hour they’ll spend this month.

Was this useful?

[Total: 14 · Average: 4.2]

Post Views: 67

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.

Alertmanager: Routing That Doesn’t Wake Your Team at 3am

Key takeaways

The starting problem

Correct anatomy

The routing tree as mental map

Grouping: the fundamental trade-off

Inhibition: say the obvious once

Silences, time intervals and human rhythm

Rotations, escalation and the antidote to fatigue

Conclusion

Related posts

How to build a production-ready agent with the Anthropic SDK, step by step

FinOps on agent tokens: the invoice that surprises

Claude Opus 4.7 and long-horizon tasks: real changes

Mature LLM-as-judge: when to trust and when not