alertas alertmanager burn-rate observabilidad prometheus slo sre watchdog

Prometheus: Writing Alerts That Won’t Get Ignored

Q: How do I avoid alert fatigue in Prometheus?

Apply for durations to filter transient spikes, use inhibition to suppress symptoms when a root cause alert is active, group alerts in Alertmanager, and periodically review which alerts haven't fired in 90 days.

July 1, 2023 11 min read 72 reads

Table of contents

Key takeaways
Symptoms vs. causes: alert on what matters to the user
Anatomy of a well-written alert
SLOs and multi-window burn rate
The watchdog: an alert that’s always on
What to polish quarterly
Conclusion
Frequently asked questions
What is the difference between alerting rules and recording rules in Prometheus?
Why are my Prometheus alerts being ignored?
How do I avoid alert fatigue in Prometheus?

Actualizado: 2026-05-03

Any team that has used Prometheus^[1] long enough has lived the same cycle: alerts are added enthusiastically, six months later the on-call channel is flooded with noise, nobody reads it, and when something serious happens the signal is lost among false positives. The problem is rarely Prometheus — it’s the rule design.

Key takeaways

Alert on customer-observable symptoms (latency, error rate, saturation), not on internal causes (high CPU, low memory).
A well-written alert includes a non-trivial for, routing labels, and annotations with summary, description, runbook, and dashboard.
SLOs with multi-window burn rate drastically reduce unnecessary pages and align alerts with real customer promises.
The watchdog — an alert that always fires — detects when the alerting system is silent without reason.
Quarterly review of signal/noise ratio is as important as writing the initial rules.

Symptoms vs. causes: alert on what matters to the user

The most important alert-design rule, defended by Google’s SRE team in the original SRE book^[2], is: alert on symptoms, not causes.

Symptom: “The 5xx error rate on /api/payments exceeds 1% for 5 minutes.”
Cause: “Pod payments-service-3 is at 95% CPU.”

The difference matters because a user doesn’t experience high CPU — they experience slow responses or errors. Alerting on causes produces two simultaneous pathologies:

False positives: a cause can fire without the user noticing (the service auto-scales and absorbs the spike).
False negatives: an unforeseen cause can produce a failure with no cause-level alert firing.

A good ruleset starts from symptoms observable from the customer’s perspective (latency, error rate, saturation) and keeps causes as diagnostic dashboards, not as paging alerts.

Anatomy of a well-written alert

A Prometheus rule with multiple windows, complete annotations, and routing labels:

yaml

- alert: ApiHighErrorRate
  expr: |
    sum by (service) (
      rate(http_requests_total{status=~"5.."}[5m])
    )
    /
    sum by (service) (
      rate(http_requests_total[5m])
    )
    > 0.01
  for: 10m
  labels:
    severity: page
    team: platform
  annotations:
    summary: "API {{ $labels.service }} error rate above 1%"
    description: |
      Service {{ $labels.service }} has had >1% 5xx error rate for the
      last 10 minutes (current: {{ $value | humanizePercentage }}).
    runbook_url: "https://runbooks.example.com/api-error-rate"
    dashboard_url: "https://grafana.example.com/d/abc/api-overview"

Four elements that must not be missing:

Non-trivial for. Between 5 and 15 minutes usually absorbs transients without excessively delaying response to real incidents.
Clear routing labels. severity (page / ticket / info) + team let Alertmanager route to different channels and each team receives only what’s theirs.
Complete annotations. summary (one line), description (context with interpolated values), runbook_url (what to do), dashboard_url (where to look). An alert without a runbook is an invitation to panic.
Correct PromQL. Use rate() on counters, not increase() directly. Group by the dimensions that matter for routing.

SLOs and multi-window burn rate

The pattern that has gained fastest adoption is SLO-based alerts with multi-window, multi-threshold burn rate, popularised by Google SRE and detailed in chapter 5 of the SRE Workbook^[3].

The idea: define an SLO (say, 99.9% success over 30 days, allowing 0.1% error budget). Instead of alerting on absolute error rate, alert when you’re burning the error budget faster than sustainable:

Burn rate > 14.4x for 1h → critical alarm (you’d consume the month’s budget in 2 days).
Burn rate > 6x for 6h → serious alarm (you’d consume the budget in 5 days).
Burn rate > 1x for 24h → trend alarm (you’re on track to spend the budget).

This aligns alerts with real customer promises (the SLO) and drastically reduces unnecessary pages. Sloth^[4] and Pyrra^[5] generate these rules automatically from a declarative SLO definition.

This monitoring infrastructure is especially valuable when combined with the kernel-level observability that eBPF provides: high-level metrics in Prometheus and kernel granularity in eBPF complement each other rather than competing.

The watchdog: an alert that’s always on

A common mistake: silent alerts. Prometheus stops scraping, Alertmanager crashes, or a config error means rules don’t evaluate. No alert fires — but no ping arrives either. Two weeks later you discover your observability has been dead.

The canonical solution: a watchdog alert that is always firing by design:

yaml

- alert: Watchdog
  expr: vector(1)
  labels:
    severity: none
  annotations:
    summary: "Prometheus is alive"

Sent to a receiver that expects it every X minutes. If it doesn’t arrive within a threshold, the receiver — typically Dead Man’s Snitch^[6] or Healthchecks.io^[7] — fires its own alert. This turns silence into signal rather than ambiguity.

What to polish quarterly

Alerts aren’t “configure and forget”. A useful ritual for on-call teams:

Quarterly review of top-N pages. Which alerts fired the most? How many led to real action? Ones always acknowledged without action should be removed or tuned.
Post-mortems with an “alerts” item. Each incident teaches: did the right alert arrive in time? Did irrelevant alerts fire in parallel?
Testing new alerts in staging. Simulate the symptom before promoting the rule to production.

This continuous-improvement discipline connects with design thinking applied to operations: the best runbooks and alerts are designed from the perspective of the operator under pressure, not the author with full context. And as our guide to installing Traefik with Docker Compose shows, observability starts at the infrastructure layer before reaching business metrics.

Conclusion

Alert on symptoms, base severity on SLOs with burn rate, monitor the alerting system itself with a watchdog, and review the signal/noise ratio quarterly: these four principles reduce on-call fatigue and improve response to real incidents. The difference between a useful on-call channel and an ignored one lies in rule design — not in the quantity of metrics collected.

Frequently asked questions

What is the difference between alerting rules and recording rules in Prometheus?

Alerting rules evaluate PromQL and fire alerts when conditions are met. Recording rules pre-compute expensive expressions and store them as new metrics, improving performance for dashboards and complex alerts.

Why are my Prometheus alerts being ignored?

Alerts are typically ignored for three reasons: too noisy, lacking actionable context in annotations, or firing when no action is possible. Good alerts are actionable, have clear severity, and only page someone when it truly matters.

How do I avoid alert fatigue in Prometheus?

Apply for durations to filter transient spikes, use inhibition to suppress symptoms when a root cause alert is active, group alerts in Alertmanager, and periodically review which alerts haven’t fired in 90 days.

Was this useful?

[Total: 11 · Average: 3.9]

Post Views: 72

Written by

Javier Cañete

CEO - Jacar Systems

Passionate about technology, cloud infrastructure and artificial intelligence. Writes about DevOps, AI, platforms and software from Madrid.