Alerting Best Practices
TL;DR
Good alerts are actionable, symptom-based, and rare — page on user impact, not causes. Alert fatigue from noisy alerts is as dangerous as no alerts.
Explanation
Principles: (1) Alert on symptoms (high error rate), not causes (disk full) — causes without symptoms aren't urgent. (2) Every alert should have a runbook. (3) Pages should require immediate human action — if it can wait, email instead. (4) Alert fatigue: too many false positives → on-call ignores real alerts. (5) Multi-window: fast burn + slow burn reduces false positives. (6) Inhibition: suppress downstream alerts when root cause is alerting. (7) Deadman switch: alert when expected metric stops — detecting monitoring failures. Practical: aim for < 5 pages/week for on-call. High-value: latency SLO, error rate, saturation.
Watch Out
⚠ Flapping alerts — ones that fire and resolve repeatedly without human action — are more damaging than no alert at all; they condition on-call engineers to ignore recurring patterns that eventually indicate real incidents.
Common Misconception
✗ More alerts means better monitoring — alert volume is a metric too. High alert volume causes fatigue and missed real incidents.
Why It Matters
Alert fatigue kills incident response — on-call engineers who receive 50 alerts/night start ignoring them, including real ones.
Common Mistakes
- Alerting on causes (disk 80%) rather than symptoms (requests failing).
- No runbook linked from alert — on-call doesn't know what to do.
- Paging for non-urgent issues — should be ticket/email.
- No inhibition — one outage fires 50 dependent alerts.
Avoid When
- Avoid alerting on thresholds without a defined response — if there is no runbook action, the alert is noise.
- Do not create alerts that auto-resolve before an engineer can investigate — they train teams to ignore firing alerts.
- Avoid identical alerts in multiple channels for the same incident; deduplication should happen before notification.
When To Use
- Alert on symptoms (error rate, latency percentiles, availability) rather than causes (CPU, memory, queue depth).
- Set multi-window burn rate alerts — burn rate over 1 h and 5 min catches both fast and slow SLO exhaustion.
- Review and prune alerts quarterly — remove any that have not led to a human action in the past 30 days.
Code Examples
💡 Note
The bad config fires on raw disk usage at 80% with no window or runbook; the good version fires after a sustained burn rate that will exhaust the SLO within a defined window, with a linked runbook.
✗ Vulnerable
# Cause-based, no runbook, too many:
alert: DiskUsageHigh
expr: disk_used_percent > 80
# Pages on-call at 3am — what should they do?
✓ Fixed
alert: ErrorBudgetBurning
expr: error_budget_remaining < 0.5
annotations:
summary: Error budget 50% consumed in 1h
runbook: https://wiki/runbooks/error-budget
severity: page
# Disk: ticket, not page:
alert: DiskUsageHigh
severity: ticket # Not a page
Tags
🤝 Adopt this term
£79/year · your link shown here
Added
23 Mar 2026
Edited
31 Mar 2026
Views
34
🤖 AI Guestbook educational data only
|
|
Last 30 days
Agents 1
No pings yesterday
Amazonbot 14
Perplexity 7
Unknown AI 4
ChatGPT 2
Ahrefs 2
Google 2
SEMrush 2
Also referenced
How they use it
crawler 30
crawler_json 1
pre-tracking 2
Related categories
⚡
DEV INTEL
Tools & Severity
🟠 High
⚙ Fix effort: Medium
⚡ Quick Fix
Link every alert to a runbook. Page only for immediate action required. Use slow+fast burn multi-window alerts. Inhibit downstream alerts when root cause fires. Review alert volume weekly.
📦 Applies To
web
cli
queue-worker
🔗 Prerequisites
🔍 Detection Hints
Auto-detectable:
✗ No
prometheus
alertmanager
pagerduty
⚠ Related Problems
🤖 AI Agent
Confidence: Low
False Positives: High
✗ Manual fix
Fix: Medium
Context: File