Alerting Best Practices
debt(d8/e6/b6/t6)
Closest to 'silent in production until users hit it' (d9), -1. Detection tools listed (prometheus, alertmanager, pagerduty) are the alerting systems themselves, not tools that detect *misuse* of alerting practices. There is no automated detection (automated: no). Poor alerting practices are only revealed when alert fatigue causes a missed real incident or post-incident review identifies gaps. However, alert volume metrics and on-call retrospectives can surface problems before full production failures, so d8 rather than d9.
Closest to 'cross-cutting refactor across the codebase' (e7), -1. The quick_fix mentions linking runbooks, adjusting severity, adding multi-window burn rates, and adding inhibition rules — these span multiple alert definitions across multiple services and require coordination with on-call teams. It's not architectural rework (not e9), but it's more than a single-component refactor since alerting rules, runbooks, PagerDuty routing, and inhibition configs all need updating across the system. Scoring e6 as it touches many files/configs but doesn't require a full architectural rewrite.
Closest to 'persistent productivity tax' (b5), +1. Alerting practices apply across all contexts (web, cli, queue-worker) and require ongoing maintenance: quarterly reviews, weekly volume checks, runbook upkeep. Poor alerting continuously degrades on-call effectiveness and incident response quality. It's not quite 'defines the system's shape' (b9) but it's a strong ongoing tax that shapes operational workflows and on-call culture, pulling it above b5 to b6.
Closest to 'notable trap' (t5), +1. The misconception — 'more alerts means better monitoring' — is a significant cognitive trap. Many competent developers intuitively believe comprehensive alerting (alerting on every metric threshold) is safer, when in reality it causes alert fatigue and worse outcomes. The common mistakes (alerting on causes vs symptoms, paging for non-urgent issues, no inhibition) show multiple ways developers consistently guess wrong. This goes beyond a single documented gotcha but doesn't quite contradict a similar concept from elsewhere (t7), so t6.
TL;DR
Explanation
Principles: (1) Alert on symptoms (high error rate), not causes (disk full) — causes without symptoms aren't urgent. (2) Every alert should have a runbook. (3) Pages should require immediate human action — if it can wait, email instead. (4) Alert fatigue: too many false positives → on-call ignores real alerts. (5) Multi-window: fast burn + slow burn reduces false positives. (6) Inhibition: suppress downstream alerts when root cause is alerting. (7) Deadman switch: alert when expected metric stops — detecting monitoring failures. Practical: aim for < 5 pages/week for on-call. High-value: latency SLO, error rate, saturation.
Watch Out
Common Misconception
Why It Matters
Common Mistakes
- Alerting on causes (disk 80%) rather than symptoms (requests failing).
- No runbook linked from alert — on-call doesn't know what to do.
- Paging for non-urgent issues — should be ticket/email.
- No inhibition — one outage fires 50 dependent alerts.
Avoid When
- Avoid alerting on thresholds without a defined response — if there is no runbook action, the alert is noise.
- Do not create alerts that auto-resolve before an engineer can investigate — they train teams to ignore firing alerts.
- Avoid identical alerts in multiple channels for the same incident; deduplication should happen before notification.
When To Use
- Alert on symptoms (error rate, latency percentiles, availability) rather than causes (CPU, memory, queue depth).
- Set multi-window burn rate alerts — burn rate over 1 h and 5 min catches both fast and slow SLO exhaustion.
- Review and prune alerts quarterly — remove any that have not led to a human action in the past 30 days.
Code Examples
# Cause-based, no runbook, too many:
alert: DiskUsageHigh
expr: disk_used_percent > 80
# Pages on-call at 3am — what should they do?
alert: ErrorBudgetBurning
expr: error_budget_remaining < 0.5
annotations:
summary: Error budget 50% consumed in 1h
runbook: https://wiki/runbooks/error-budget
severity: page
# Disk: ticket, not page:
alert: DiskUsageHigh
severity: ticket # Not a page