← Back to glossary

SLO / SLI / SLA

Q: What is a common misconception about SLO / SLI / SLA?

SLO and SLA are the same — SLO is internal (aspirational target); SLA is external (contractual). SLO is stricter so you catch issues before breaching the SLA.

Q: Why does SLO / SLI / SLA matter?

SLOs replace vague reliability goals with measurable targets — making on-call decisions data-driven: 'should we deploy?' becomes 'how much error budget remains?'

Q: How do I fix SLO / SLI / SLA?

Define SLIs for availability and latency. Set SLO 0.5% stricter than SLA. Track 28-day rolling window. Alert when error budget < 50% consumed.

Observability Intermediate

debt(d7/e5/b7/t5)

d7 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'only careful code review or runtime testing' (d7). The metadata states automated detection is 'no', and while tools like Prometheus, Datadog, and Grafana can surface SLO metrics, they only help if SLOs have already been defined and wired up. Misconfigurations — like SLOs set without baselines, or conflating SLO with SLA — are invisible operationally until month-end reviews or customer complaints surface the breach. No tool automatically flags that your SLO is misconfigured relative to your SLA.

e5 Effort Remediation debt — work required to fix once spotted

Closest to 'touches multiple files / significant refactor in one component' (e5). The quick_fix describes defining SLIs for availability and latency, setting SLO 0.5% stricter than SLA, and tracking a 28-day rolling window with alerting — this spans instrumentation code, alerting configuration, dashboards, and organizational process. It is more than a one-line fix but doesn't require full architectural rework. It touches multiple systems (metrics pipeline, alerting rules, runbooks).

b7 Burden Structural debt — long-term weight of choosing wrong

Closest to 'strong gravitational pull' (d7, mapped as b7). SLOs apply across web, CLI, and queue-worker contexts per applies_to. Once adopted, every deployment decision, on-call policy, and feature prioritization is shaped by error budget consumption. The choice propagates into monitoring config, incident response, team processes, and product roadmaps — nearly every engineering workflow is influenced by how SLOs are defined.

t5 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'notable trap — a documented gotcha most devs eventually learn' (t5). The misconception field explicitly states the canonical trap: developers conflate SLO and SLA, treating them as synonyms. The common_mistakes reinforce this — confusing SLO with SLA is listed as a named mistake. This is a well-known gotcha in SRE circles but not immediately obvious to developers new to observability, making it a solid t5.

About DEBT scoring → scored by claude-sonnet-4-6 · 2026-05-11 · reviewed by human

TL;DR

SLI (what you measure), SLO (your internal target), SLA (your customer commitment) — the hierarchy that turns vague 'uptime' promises into measurable operational objectives.

Explanation

SLI (Service Level Indicator): a measured metric — request success rate, latency p99, availability. SLO (Service Level Objective): your target for an SLI — 'p99 latency < 200ms', '99.9% requests succeed'. Internal goal — what you aim for. SLA (Service Level Agreement): a contractual commitment with consequences (refunds, penalties) — '99.9% uptime per month'. Usually less strict than SLO (buffer). Error budget: (1 - SLO) × time period. 99.9% SLO = 43.8 min/month error budget. SLOs guide engineering priorities — burn through error budget fast → freeze releases, investigate. Google SRE book introduced this framework.

Common Misconception

✗ SLO and SLA are the same — SLO is internal (aspirational target); SLA is external (contractual). SLO is stricter so you catch issues before breaching the SLA.

Why It Matters

SLOs replace vague reliability goals with measurable targets — making on-call decisions data-driven: 'should we deploy?' becomes 'how much error budget remains?'

Common Mistakes

Setting SLOs without measuring the current baseline — targets must be achievable.
Confusing SLO with SLA — SLO should be stricter than SLA.
Not tracking SLO compliance continuously — only noticing at month end.

Code Examples

✗ Vulnerable

// Vague commitment:
// 'We aim for high availability'
// No measurement, no target, no accountability

✓ Fixed

// SLI: request success rate (non-5xx / total)
// SLO: 99.5% over 28-day rolling window
// SLA: 99.0% (contractual, with refund below)

// Prometheus SLO:
// sum(rate(http_requests_total{code!~'5..'}[28d])) /
// sum(rate(http_requests_total[28d])) > 0.995

// Error budget remaining:
// (1 - 0.995) * 28d = 2h error budget/month

References

https://sre.google/sre-book/service-level-objectives/