When should you NOT use Alerting Best Practices?

Avoid alerting on thresholds without a defined response — if there is no runbook action, the alert is noise. Do not create alerts that auto-resolve before an engineer can investigate — they train teams to ignore firing alerts. Avoid identical alerts in multiple channels for the same incident; deduplication should happen before notification.

When is Alerting Best Practices the right choice?

Alert on symptoms (error rate, latency percentiles, availability) rather than causes (CPU, memory, queue depth). Set multi-window burn rate alerts — burn rate over 1 h and 5 min catches both fast and slow SLO exhaustion. Review and prune alerts quarterly — remove any that have not led to a human action in the past 30 days.

← Back to glossary

Alerting Best Practices

Q: What is a common misconception about Alerting Best Practices?

More alerts means better monitoring — alert volume is a metric too. High alert volume causes fatigue and missed real incidents.

Q: Why does Alerting Best Practices matter?

Alert fatigue kills incident response — on-call engineers who receive 50 alerts/night start ignoring them, including real ones.

Q: How do I fix Alerting Best Practices?

Link every alert to a runbook. Page only for immediate action required. Use slow+fast burn multi-window alerts. Inhibit downstream alerts when root cause fires. Review alert volume weekly.

observability Intermediate

TL;DR

Good alerts are actionable, symptom-based, and rare — page on user impact, not causes. Alert fatigue from noisy alerts is as dangerous as no alerts.

Explanation

Principles: (1) Alert on symptoms (high error rate), not causes (disk full) — causes without symptoms aren't urgent. (2) Every alert should have a runbook. (3) Pages should require immediate human action — if it can wait, email instead. (4) Alert fatigue: too many false positives → on-call ignores real alerts. (5) Multi-window: fast burn + slow burn reduces false positives. (6) Inhibition: suppress downstream alerts when root cause is alerting. (7) Deadman switch: alert when expected metric stops — detecting monitoring failures. Practical: aim for < 5 pages/week for on-call. High-value: latency SLO, error rate, saturation.

Watch Out

⚠ Flapping alerts — ones that fire and resolve repeatedly without human action — are more damaging than no alert at all; they condition on-call engineers to ignore recurring patterns that eventually indicate real incidents.

Common Misconception

✗ More alerts means better monitoring — alert volume is a metric too. High alert volume causes fatigue and missed real incidents.

Why It Matters

Alert fatigue kills incident response — on-call engineers who receive 50 alerts/night start ignoring them, including real ones.

Common Mistakes

Alerting on causes (disk 80%) rather than symptoms (requests failing).
No runbook linked from alert — on-call doesn't know what to do.
Paging for non-urgent issues — should be ticket/email.
No inhibition — one outage fires 50 dependent alerts.

Avoid When

Avoid alerting on thresholds without a defined response — if there is no runbook action, the alert is noise.
Do not create alerts that auto-resolve before an engineer can investigate — they train teams to ignore firing alerts.
Avoid identical alerts in multiple channels for the same incident; deduplication should happen before notification.

When To Use

Alert on symptoms (error rate, latency percentiles, availability) rather than causes (CPU, memory, queue depth).
Set multi-window burn rate alerts — burn rate over 1 h and 5 min catches both fast and slow SLO exhaustion.
Review and prune alerts quarterly — remove any that have not led to a human action in the past 30 days.

Code Examples

💡 NoteThe bad config fires on raw disk usage at 80% with no window or runbook; the good version fires after a sustained burn rate that will exhaust the SLO within a defined window, with a linked runbook.

✗ Vulnerable

# Cause-based, no runbook, too many:
alert: DiskUsageHigh
expr: disk_used_percent > 80
# Pages on-call at 3am — what should they do?

✓ Fixed

alert: ErrorBudgetBurning
expr: error_budget_remaining < 0.5
annotations:
  summary: Error budget 50% consumed in 1h
  runbook: https://wiki/runbooks/error-budget
  severity: page

# Disk: ticket, not page:
alert: DiskUsageHigh
severity: ticket # Not a page

References

↗ https://sre.google/workbook/alerting-on-slos/