When should you NOT use Alerting & On-Call?

Avoid alerting on raw infrastructure metrics (CPU, memory) in isolation — they are symptoms, not the user-facing problem. Do not set alert thresholds without historical data — too sensitive creates noise, too loose misses real incidents. Avoid alerts with no clear owner or escalation path — unowned alerts are routinely silenced and never acted on.

When is Alerting & On-Call the right choice?

Alert when a user-facing SLI (error rate, latency, availability) breaches its SLO threshold for a sustained window. Use alerting for conditions that require a human decision — if the correct response is fully automated, self-heal instead of page. Tie every alert to a runbook so on-call engineers know exactly what to check and what actions are authorised.

← Back to glossary

Alerting & On-Call

devops PHP 5.0+ Intermediate

debt(d7/e5/b7/t7)

d7 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'only careful code review or runtime testing' (d7). While tools like Prometheus, Grafana, Datadog, and PagerDuty exist, they don't automatically detect *bad* alerting practices — they're the platforms where alerts are configured, not analyzers of alert quality. Detecting problems like alert fatigue, missing SLO-based alerts, or alerting on causes instead of symptoms requires careful operational review or waiting until production incidents are missed. There's no automated linter that catches 'you're alerting on CPU instead of error rate.' The detection_hints confirm automated=no.

e5 Effort Remediation debt — work required to fix once spotted

Closest to 'touches multiple files / significant refactor in one component' (e5). The quick_fix summarizes it as 'alert on symptoms not causes,' which sounds simple but in practice requires reviewing and restructuring all existing alert rules, establishing SLO thresholds, creating runbooks, and tuning thresholds based on historical baselines. This touches alerting configurations across multiple services and monitoring platforms. Not quite architectural rework (e7), but definitely more than a simple parameterized fix.

b7 Burden Structural debt — long-term weight of choosing wrong

Closest to 'strong gravitational pull' (b7). Alerting strategy applies across all contexts (web, cli) and shapes how the entire team operates. Poor alerting decisions create ongoing operational tax: alert fatigue degrades on-call effectiveness over time (decay), and alerting configuration touches every service in the system (reach). Every new service or feature must have its alerting strategy considered. The choice of alerting philosophy (symptoms vs causes, SLO-based burn rates) shapes incident response culture and on-call practices system-wide.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap — contradicts how a similar concept works elsewhere' (t7). The misconception field directly states the trap: 'More alerts mean better monitoring coverage.' This is the intuitive but wrong belief — a competent developer new to operations would naturally assume comprehensive resource-level alerting (CPU, memory, disk) provides better coverage. The correct approach (alert on symptoms, fewer high-signal alerts) contradicts the 'more is better' instinct. Alerting on CPU>80% feels responsible but misses the point — resources can spike without user impact. This trap is well-documented but consistently catches teams.

About DEBT scoring → scored by claude-opus-4-6 · 2026-05-06 · reviewed by human

Also Known As

monitoring alerts alert rules on-call alerts

TL;DR

Automated notifications triggered when SLIs breach SLO thresholds — effective alerting is actionable, low-noise, and tied to clear runbooks.

Explanation

Good alerting fires on symptoms, not causes — alert when users are impacted (elevated error rate, high latency, failed jobs), not on every infrastructure metric. Alert fatigue from too many low-quality alerts leads to on-call engineers ignoring pages. Principles: every alert must be actionable (a human can do something about it), have a runbook, and have a clear severity level. Tools: Prometheus Alertmanager, Datadog monitors, PagerDuty routing. For PHP applications, critical alerts: 5xx rate > 1% for 5 minutes, p99 latency > 2s, queue depth > 10,000, failed cron jobs, certificate expiry within 14 days. Track mean time to acknowledge (MTTA) and mean time to resolve (MTTR) as on-call health metrics.

Watch Out

⚠ An alert that fires without a corresponding runbook trains engineers to dismiss it — every alert must have a documented response action or it should be removed.

Common Misconception

✗ More alerts mean better monitoring coverage. Alert fatigue from too many low-signal alerts causes teams to ignore them — including critical ones. Alerts should be actionable, specific, and tied to user-visible impact. Alert on symptoms (error rate up) not causes (CPU usage elevated).

Why It Matters

Alerting that fires on symptoms (high error rate, high latency) catches problems before users notice — but poorly tuned alerts that cry wolf train teams to ignore them, missing real incidents.

Common Mistakes

Alerting on every spike — low signal-to-noise ratio causes alert fatigue and ignored alerts.
Not setting alert thresholds based on historical baselines — arbitrary numbers produce false positives.
Alerting on resource usage (CPU 80%) rather than symptoms (error rate >1%) — resources spike without user impact.
No runbook linked from the alert — on-call engineers must guess what to do at 3am.

Avoid When

Avoid alerting on raw infrastructure metrics (CPU, memory) in isolation — they are symptoms, not the user-facing problem.
Do not set alert thresholds without historical data — too sensitive creates noise, too loose misses real incidents.
Avoid alerts with no clear owner or escalation path — unowned alerts are routinely silenced and never acted on.

When To Use

Alert when a user-facing SLI (error rate, latency, availability) breaches its SLO threshold for a sustained window.
Use alerting for conditions that require a human decision — if the correct response is fully automated, self-heal instead of page.
Tie every alert to a runbook so on-call engineers know exactly what to check and what actions are authorised.

Code Examples

💡 NoteThe bad alert fires on 80% CPU for any duration with no context; the good alert fires on sustained SLO breach with severity, window, and a linked runbook.

✗ Vulnerable

// Alert on metric with no context:
[
  'alert' => 'HighCPU',
  'condition' => 'cpu > 80%',
  'severity' => 'critical',
  // No runbook, no context, no historical baseline
  // Fires 20 times a day — team ignores it
]

✓ Fixed

# Alerting principles — page on symptoms, not causes

# Symptom-based alerts (user-visible, always page):
# - Error rate > 1% for 5 minutes
# - p99 latency > 2000ms for 5 minutes
# - Health endpoint returning non-200

# Cause-based alerts (ticket, don't page at 3am):
# - Disk > 80% used
# - Queue depth > 10,000
# - Redis memory > 70%

# PHP health endpoint:
Route::get('/health', function() {
    return response()->json([
        'db'    => DB::connection()->getPdo() ? 'ok' : 'error',
        'cache' => Cache::connection()->ping() ? 'ok' : 'error',
        'queue' => Queue::size() < 50000 ? 'ok' : 'degraded',
        'php'   => PHP_VERSION,
    ]);
});

# Alert routing: critical → PagerDuty → on-call engineer
#                warning  → Slack #alerts

References

↗ https://sre.google/sre-book/practical-alerting/