Alerting & On-Call
debt(d7/e5/b7/t7)
Closest to 'only careful code review or runtime testing' (d7). While tools like Prometheus, Grafana, Datadog, and PagerDuty exist, they don't automatically detect *bad* alerting practices — they're the platforms where alerts are configured, not analyzers of alert quality. Detecting problems like alert fatigue, missing SLO-based alerts, or alerting on causes instead of symptoms requires careful operational review or waiting until production incidents are missed. There's no automated linter that catches 'you're alerting on CPU instead of error rate.' The detection_hints confirm automated=no.
Closest to 'touches multiple files / significant refactor in one component' (e5). The quick_fix summarizes it as 'alert on symptoms not causes,' which sounds simple but in practice requires reviewing and restructuring all existing alert rules, establishing SLO thresholds, creating runbooks, and tuning thresholds based on historical baselines. This touches alerting configurations across multiple services and monitoring platforms. Not quite architectural rework (e7), but definitely more than a simple parameterized fix.
Closest to 'strong gravitational pull' (b7). Alerting strategy applies across all contexts (web, cli) and shapes how the entire team operates. Poor alerting decisions create ongoing operational tax: alert fatigue degrades on-call effectiveness over time (decay), and alerting configuration touches every service in the system (reach). Every new service or feature must have its alerting strategy considered. The choice of alerting philosophy (symptoms vs causes, SLO-based burn rates) shapes incident response culture and on-call practices system-wide.
Closest to 'serious trap — contradicts how a similar concept works elsewhere' (t7). The misconception field directly states the trap: 'More alerts mean better monitoring coverage.' This is the intuitive but wrong belief — a competent developer new to operations would naturally assume comprehensive resource-level alerting (CPU, memory, disk) provides better coverage. The correct approach (alert on symptoms, fewer high-signal alerts) contradicts the 'more is better' instinct. Alerting on CPU>80% feels responsible but misses the point — resources can spike without user impact. This trap is well-documented but consistently catches teams.
Also Known As
TL;DR
Explanation
Good alerting fires on symptoms, not causes — alert when users are impacted (elevated error rate, high latency, failed jobs), not on every infrastructure metric. Alert fatigue from too many low-quality alerts leads to on-call engineers ignoring pages. Principles: every alert must be actionable (a human can do something about it), have a runbook, and have a clear severity level. Tools: Prometheus Alertmanager, Datadog monitors, PagerDuty routing. For PHP applications, critical alerts: 5xx rate > 1% for 5 minutes, p99 latency > 2s, queue depth > 10,000, failed cron jobs, certificate expiry within 14 days. Track mean time to acknowledge (MTTA) and mean time to resolve (MTTR) as on-call health metrics.
Watch Out
Common Misconception
Why It Matters
Common Mistakes
- Alerting on every spike — low signal-to-noise ratio causes alert fatigue and ignored alerts.
- Not setting alert thresholds based on historical baselines — arbitrary numbers produce false positives.
- Alerting on resource usage (CPU 80%) rather than symptoms (error rate >1%) — resources spike without user impact.
- No runbook linked from the alert — on-call engineers must guess what to do at 3am.
Avoid When
- Avoid alerting on raw infrastructure metrics (CPU, memory) in isolation — they are symptoms, not the user-facing problem.
- Do not set alert thresholds without historical data — too sensitive creates noise, too loose misses real incidents.
- Avoid alerts with no clear owner or escalation path — unowned alerts are routinely silenced and never acted on.
When To Use
- Alert when a user-facing SLI (error rate, latency, availability) breaches its SLO threshold for a sustained window.
- Use alerting for conditions that require a human decision — if the correct response is fully automated, self-heal instead of page.
- Tie every alert to a runbook so on-call engineers know exactly what to check and what actions are authorised.
Code Examples
// Alert on metric with no context:
[
'alert' => 'HighCPU',
'condition' => 'cpu > 80%',
'severity' => 'critical',
// No runbook, no context, no historical baseline
// Fires 20 times a day — team ignores it
]
# Alerting principles — page on symptoms, not causes
# Symptom-based alerts (user-visible, always page):
# - Error rate > 1% for 5 minutes
# - p99 latency > 2000ms for 5 minutes
# - Health endpoint returning non-200
# Cause-based alerts (ticket, don't page at 3am):
# - Disk > 80% used
# - Queue depth > 10,000
# - Redis memory > 70%
# PHP health endpoint:
Route::get('/health', function() {
return response()->json([
'db' => DB::connection()->getPdo() ? 'ok' : 'error',
'cache' => Cache::connection()->ping() ? 'ok' : 'error',
'queue' => Queue::size() < 50000 ? 'ok' : 'degraded',
'php' => PHP_VERSION,
]);
});
# Alert routing: critical → PagerDuty → on-call engineer
# warning → Slack #alerts