← CodeClarityLab Home
Browse by Category
+ added · updated 7d
← Back to glossary

Alerting Best Practices

observability Intermediate

TL;DR

Good alerts are actionable, symptom-based, and rare — page on user impact, not causes. Alert fatigue from noisy alerts is as dangerous as no alerts.

Explanation

Principles: (1) Alert on symptoms (high error rate), not causes (disk full) — causes without symptoms aren't urgent. (2) Every alert should have a runbook. (3) Pages should require immediate human action — if it can wait, email instead. (4) Alert fatigue: too many false positives → on-call ignores real alerts. (5) Multi-window: fast burn + slow burn reduces false positives. (6) Inhibition: suppress downstream alerts when root cause is alerting. (7) Deadman switch: alert when expected metric stops — detecting monitoring failures. Practical: aim for < 5 pages/week for on-call. High-value: latency SLO, error rate, saturation.

Watch Out

Flapping alerts — ones that fire and resolve repeatedly without human action — are more damaging than no alert at all; they condition on-call engineers to ignore recurring patterns that eventually indicate real incidents.

Common Misconception

More alerts means better monitoring — alert volume is a metric too. High alert volume causes fatigue and missed real incidents.

Why It Matters

Alert fatigue kills incident response — on-call engineers who receive 50 alerts/night start ignoring them, including real ones.

Common Mistakes

  • Alerting on causes (disk 80%) rather than symptoms (requests failing).
  • No runbook linked from alert — on-call doesn't know what to do.
  • Paging for non-urgent issues — should be ticket/email.
  • No inhibition — one outage fires 50 dependent alerts.

Avoid When

  • Avoid alerting on thresholds without a defined response — if there is no runbook action, the alert is noise.
  • Do not create alerts that auto-resolve before an engineer can investigate — they train teams to ignore firing alerts.
  • Avoid identical alerts in multiple channels for the same incident; deduplication should happen before notification.

When To Use

  • Alert on symptoms (error rate, latency percentiles, availability) rather than causes (CPU, memory, queue depth).
  • Set multi-window burn rate alerts — burn rate over 1 h and 5 min catches both fast and slow SLO exhaustion.
  • Review and prune alerts quarterly — remove any that have not led to a human action in the past 30 days.

Code Examples

💡 Note
The bad config fires on raw disk usage at 80% with no window or runbook; the good version fires after a sustained burn rate that will exhaust the SLO within a defined window, with a linked runbook.
✗ Vulnerable
# Cause-based, no runbook, too many:
alert: DiskUsageHigh
expr: disk_used_percent > 80
# Pages on-call at 3am — what should they do?
✓ Fixed
alert: ErrorBudgetBurning
expr: error_budget_remaining < 0.5
annotations:
  summary: Error budget 50% consumed in 1h
  runbook: https://wiki/runbooks/error-budget
  severity: page

# Disk: ticket, not page:
alert: DiskUsageHigh
severity: ticket # Not a page

Added 23 Mar 2026
Edited 31 Mar 2026
Views 34
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings W 0 pings T 0 pings F 0 pings S 2 pings S 0 pings M 0 pings T 0 pings W 0 pings T 2 pings F 0 pings S 0 pings S 1 ping M 0 pings T 0 pings W 0 pings T 2 pings F 0 pings S 0 pings S 1 ping M 1 ping T 0 pings W 0 pings T 3 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 1 ping T
No pings yesterday
Amazonbot 14 Perplexity 7 Unknown AI 4 ChatGPT 2 Ahrefs 2 Google 2 SEMrush 2
crawler 30 crawler_json 1 pre-tracking 2
DEV INTEL Tools & Severity
🟠 High ⚙ Fix effort: Medium
⚡ Quick Fix
Link every alert to a runbook. Page only for immediate action required. Use slow+fast burn multi-window alerts. Inhibit downstream alerts when root cause fires. Review alert volume weekly.
📦 Applies To
web cli queue-worker
🔗 Prerequisites
🔍 Detection Hints
Auto-detectable: ✗ No prometheus alertmanager pagerduty
⚠ Related Problems
🤖 AI Agent
Confidence: Low False Positives: High ✗ Manual fix Fix: Medium Context: File

✓ schema.org compliant