SLO / SLI / SLA
debt(d7/e5/b7/t5)
Closest to 'only careful code review or runtime testing' (d7). The metadata states automated detection is 'no', and while tools like Prometheus, Datadog, and Grafana can surface SLO metrics, they only help if SLOs have already been defined and wired up. Misconfigurations — like SLOs set without baselines, or conflating SLO with SLA — are invisible operationally until month-end reviews or customer complaints surface the breach. No tool automatically flags that your SLO is misconfigured relative to your SLA.
Closest to 'touches multiple files / significant refactor in one component' (e5). The quick_fix describes defining SLIs for availability and latency, setting SLO 0.5% stricter than SLA, and tracking a 28-day rolling window with alerting — this spans instrumentation code, alerting configuration, dashboards, and organizational process. It is more than a one-line fix but doesn't require full architectural rework. It touches multiple systems (metrics pipeline, alerting rules, runbooks).
Closest to 'strong gravitational pull' (d7, mapped as b7). SLOs apply across web, CLI, and queue-worker contexts per applies_to. Once adopted, every deployment decision, on-call policy, and feature prioritization is shaped by error budget consumption. The choice propagates into monitoring config, incident response, team processes, and product roadmaps — nearly every engineering workflow is influenced by how SLOs are defined.
Closest to 'notable trap — a documented gotcha most devs eventually learn' (t5). The misconception field explicitly states the canonical trap: developers conflate SLO and SLA, treating them as synonyms. The common_mistakes reinforce this — confusing SLO with SLA is listed as a named mistake. This is a well-known gotcha in SRE circles but not immediately obvious to developers new to observability, making it a solid t5.
TL;DR
Explanation
SLI (Service Level Indicator): a measured metric — request success rate, latency p99, availability. SLO (Service Level Objective): your target for an SLI — 'p99 latency < 200ms', '99.9% requests succeed'. Internal goal — what you aim for. SLA (Service Level Agreement): a contractual commitment with consequences (refunds, penalties) — '99.9% uptime per month'. Usually less strict than SLO (buffer). Error budget: (1 - SLO) × time period. 99.9% SLO = 43.8 min/month error budget. SLOs guide engineering priorities — burn through error budget fast → freeze releases, investigate. Google SRE book introduced this framework.
Common Misconception
Why It Matters
Common Mistakes
- Setting SLOs without measuring the current baseline — targets must be achievable.
- Confusing SLO with SLA — SLO should be stricter than SLA.
- Not tracking SLO compliance continuously — only noticing at month end.
Code Examples
// Vague commitment:
// 'We aim for high availability'
// No measurement, no target, no accountability
// SLI: request success rate (non-5xx / total)
// SLO: 99.5% over 28-day rolling window
// SLA: 99.0% (contractual, with refund below)
// Prometheus SLO:
// sum(rate(http_requests_total{code!~'5..'}[28d])) /
// sum(rate(http_requests_total[28d])) > 0.995
// Error budget remaining:
// (1 - 0.995) * 28d = 2h error budget/month