DEBT — the formal spec.
Format, axes, anchor examples, scoring procedure, JSON schema, FAQ. This page is the canonical reference. Implementers should pin to it.
Format
A DEBT score is a single string, structured as:
debt(d<0-9>/e<0-9>/b<0-9>/t<0-9>)
Each axis is an integer in [0, 9]:
- 1–9 — valid score, anchored to the rubric in the axis section below.
- 0 — the axis genuinely doesn't apply to this concept (e.g. a pure syntax term has no Burden). Score 0 lives outside the rubric grid — see the N/A band for when to use it. Score 0 honestly. Don't force a 1.
Letter prefixes must be lowercase: d, e, b, t. Slash-separated. No spaces inside the parens. Order is fixed: d/e/b/t — the four axis letters spell the standard's name, so anyone seeing debt(d_/e_/b_/t_) can decode it without a legend.
Examples
debt(d4/e3/b2/t5)— a known security trap, partly caught by tooling, mechanically fixable, no lasting costdebt(d7/e4/b2/t9)— a catastrophic trap, mostly invisible to tools, localised fix, no lasting costdebt(d8/e7/b9/t3)— not a trap technically, but invisible operationally, expensive to undo, project-definingdebt(d1/e1/b0/t1)— benign concept; no surprise, fully detected, trivial to fix, no architectural commitment
The four axes
Each axis measures one independent kind of debt. They're independent on purpose — if you can predict one from another, the score is wasting a slot. Each axis carries its own identity colour throughout this rubric.
Can the compiler, linter, static analyser, test suite, or peer review catch a misuse before it ships?
Once you know the problem exists in your codebase, how much work to fix it?
If you make the wrong call here at the start of a project, how much pain does that inflict on every future maintainer?
Will a competent developer who's never used this concept guess wrong about how it behaves?
Both flavours score on the same b axis — you're not asked to separate them in the score itself. But naming the flavour in your rationale helps reviewers understand which kind of structural cost you're calling out:
The choice slowly poisons the codebase over months and years. Time is the carrier. Examples: an ORM that fights your DB; microservices adopted before product-market fit; a sync-only framework when you'll need real concurrency. The pain compounds gradually as the codebase grows around the wrong foundation.
The choice is load-bearing across a wide surface, so any bug or change has system-wide blast radius. Surface area is the carrier. Examples: a custom auth helper used by every controller; shared mutable state across a fan-out of services; a load-bearing data model the whole product depends on. The pain is acute the moment something needs to change.
A concept can have either flavour, both, or neither. Score them on the b-axis using the same anchor grid; flag the flavour in the rationale string (e.g. "Closest to b5 'persistent productivity tax' — reach flavour: a load-bearing helper used by every controller").
Anchors — the rubric
The reference grid below is the heart of the rubric for scores 1–9. Pick an axis and a score in the picker; the matching cell highlights. For scores at even numbers (2, 4, 6, 8), the row is intentionally blank — those scores are valid but interpolated between the surrounding anchors. Picking an even score highlights it together with the bracketing anchors above and below.
array_keys() returns the keys of an array. Behaves as named.array_merge() renumbers integer keys but preserves string keys. Trips you up once, then you remember.== vs === in PHP/JavaScript. Most experienced devs know. Still produces bugs in auth and validation under deadline pressure."0e1234..." == "0" evaluates to true. Has caused real CVEs (magic-hash bypass). Contradicts how == works in nearly every other language.__proto__ mutates every object. So far from the "objects are isolated" mental model that even seniors ship code with this footgun for years.Quick tests — one per axis
Ask "what would catch this?" Compiler → 1–2. Any linter → 3. A specialist tool → 5. An expert reviewer → 7. Nothing pre-production → 8–9.
"How long would a solo competent dev take to fix this?" One hour → 1–2. One day → 3. One week → 5. One month → 7. One quarter or more → 9.
"Five years from now, how much pain has this choice inflicted on devs who joined after it was made?" None → 1–2. Constant minor friction → 3–5. Major refactoring inevitable → 6–7. Project-existential → 8–9.
If you can imagine a competent senior dev getting this wrong on a Friday afternoon, score it ≥ 6. If only a junior would miss it, 3–5. If even a careful junior wouldn't miss it, 1–2.
Worked examples
Three concepts profiled end-to-end, so you can see how the four axes combine into a recognisable shape.
- d4
- Specialist linters catch the obvious cases. Subtle ones (dynamic table names, ORDER BY) slip through.
- e3
- Per-query fix is mechanical: parameterise. Audit across the codebase is the real cost.
- b2
- No long-term structural pain — once parameterised, it stays fixed.
- t5
- Most experienced devs know about it but still write vulnerable code under deadline. Notable trap.
- d8
- Almost no tooling tells you "this team is too small for this architecture." You learn it from velocity collapse.
- e7
- Going back to a monolith is months of disentangling, with hybrid bugs along the way.
- b9
- Project-existential. Many small teams die at this rock before realising it.
- t3
- Not surprising at the technical level — APIs work as documented.
- d7
- Static analysis catches some patterns. Subtle propagation through user input usually misses.
- e4
- Localised: validate input, freeze prototypes. A few hours to a couple of days per affected library.
- b2
- No long-term structural cost. Once patched, it stays patched.
- t9
- Catastrophic trap. Mutating one object's prototype changes every object — behaviour is far from any reasonable mental model.
Notice how different the shapes are. SQL injection is mid-trap, mid-effort, low-burden. Microservices-too-early is low-trap but huge burden. Prototype pollution is high-trap but low-burden. Three concepts, three different debt profiles, three different remediation strategies. That's what a single severity score cannot tell you.
Scoring with rationale
The rubric only delivers value if the scorer actually uses it. The procedure below is intentionally restrictive — it forces every score to cite which rubric anchor it's calibrating against, so a score isn't just plausible, it's auditable.
Every score must be accompanied by a rationale string that names the anchor it's calibrated against. Without rationale, a score is just a number; with rationale, future-you (or another reviewer) can verify whether the rubric was actually consulted.
The procedure
- Read the concept's definition first. Ground yourself in what the concept actually is, not what you remember.
- For each axis, locate the closest anchor. Open the anchor grid above. For axis D, find the row whose anchor name and example most closely match this concept's behaviour. Note the anchor's name and its score (1, 3, 5, 7, or 9).
- Assign the anchor's score ±1. If the concept matches the anchor exactly, use the anchor's score. If it's slightly worse than the anchor, score one level higher (an even number between this anchor and the next). If it's slightly better, score one level lower. You may only deviate by 1 from the closest anchor. When a concept genuinely sits between two anchors (e.g. somewhere between d3 and d5), cite whichever anchor your reasoning departs from — if your argument is "easier than specialist tool but harder than default linter," cite d5 with −1 if the concept is closer to specialist territory, or d3 with +1 if closer to default-linter territory. Pick the anchor that grounds your specific argument.
- Write a rationale string that names the anchor and explains the ±1 deviation if any. Example for d4: "Closest to specialist tool catches (d5), but most generic linters also catch the obvious case — one step easier."
- Use 0 only when the axis question genuinely cannot be asked. Pure syntax (
semicolon) and pure factual reference (http_status_codes) often score 0 on multiple axes — no architectural commitment to regret (b0), no fix because there's no bug (e0), and so on. Don't use 0 just because the answer is "barely any" — that's a 1. See the N/A band for the full explanation. - Sanity-check the shape. If all four scores cluster within 1 of each other (e.g. all 2–3, all 4–5), you probably didn't score them independently. Re-read each axis question and try again. See detecting bad scores below for specific patterns.
Worked rationale — SQL injection at d4
To make rationale-quality concrete, here's how a well-formed rationale reads for one axis:
Why −1: Many default linter configurations now ship with basic injection rules out of the box, putting common cases one step easier than "specialist tool" implies. Subtle cases (dynamic ORDER BY, table-name interpolation) still need specialist tools.
That rationale is auditable — another reviewer can read it, look up d3 ("default linter catches") and d5 ("specialist tool catches") in the grid, and decide whether the +/-1 deviation argument holds. A rationale of "hard to detect" would not be auditable; it cites no anchor and could justify any score from 4 to 9.
AI-assisted scoring
DEBT is designed to be applied at scale. Scoring 1,000+ entries by hand isn't realistic; the intended workflow uses an LLM to propose scores, with humans reviewing.
The rationale requirement is what makes this tractable. Without anchor names in the output, LLMs drift wildly between runs because nothing pins the score to the rubric. With required rationale that must cite an anchor, the model is forced to consult the grid for each axis — and a human reviewer can spot in seconds whether the cited anchor actually matches the concept.
- An LLM proposes scores using a prompt that includes the full anchor grid + a required output format (see the LLM scoring prompt appendix).
- The model returns score + rationale per axis. Rationale must name the anchor it calibrated against. Outputs without anchor names are rejected automatically.
- A human reviews and accepts, edits, or rejects. The rationale strings make this fast — reviewers verify the cited anchor matches the concept, rather than re-doing the rubric lookup themselves.
- Audit periodically. Sample 5–10% of scored entries quarterly. Look specifically for the failure patterns described below. Drift in any axis means the rubric needs sharpening or the LLM prompt needs tightening.
JSON schema
Recommended canonical storage shape, for tooling that wants to query and filter scores:
{
"debt_score": {
"detectability": 3,
"effort": 2,
"burden": 5,
"trap": 7,
"rationale": {
"detectability": "Closest to 'default linter catches' (d3). Standard ESLint config flags == vs === out of the box.",
"effort": "Closest to 'one-line patch' (e1), but the fix touches every comparison site. Score +1 to e2 for the multi-occurrence overhead.",
"burden": "Closest to 'persistent productivity tax' (b5). The choice between == and === influences every comparison written from this point forward.",
"trap": "Closest to 'serious trap' (t7). Type juggling contradicts how == works in nearly every other language."
},
"scored_at": "2026-05-03",
"scored_by": "claude-haiku-4-5",
"reviewed_by": "human",
"version": "1.0"
}
}
The rationale field is required in DEBT 1.0. Each axis's rationale must name the anchor it's calibrated against (format: "Closest to 'anchor name' (axis-letter+score), [why this score / why ±1 from the anchor]"). Tools that strip rationale or accept scores without anchor names are non-conformant. Without rationale, scores cannot be audited against the rubric, and the system collapses into "four numbers an LLM made up."
For inline display, render as the compact format: debt(d3/e2/b5/t7). The compact form is the citation; the full object — including rationale — is the data of record.
Common mistakes
Treating debt as moral failure
Debt isn't shame, it's accounting. Every codebase has debt. Some is unavoidable, some is well-chosen tradeoff. The point of profiling debt isn't to feel bad about high-debt concepts — it's to know you have them, so you can budget around them.
Conflating Trap with Difficulty
Difficulty is how hard a concept is to learn. Trap is how much the behaviour contradicts a reasonable assumption. RegExp is hard but predictable. == in PHP is easy but full of traps. Score them independently.
Conflating Detectability with Effort
A bug can be invisible to tooling (D=8) but a one-line fix once found (E=1). They're independent dimensions. The hard-to-detect, easy-to-fix bug is operationally different from the easy-to-detect, hard-to-fix one.
Inflation drift
Don't score every term ≥ 5 because "all bugs are bad." If most of your terms come out 7–9, the system has lost meaning. Force yourself to use the full range — some concepts really are 1s.
Recency bias
A concept that recently bit you isn't necessarily a 9. Score against the rubric anchors, not against your last incident. The rubric is the calibration; your memory is not.
Burden on terms where it doesn't apply
Pure syntax (semicolon, php_comment), pure factual reference (http_status_codes), and historical/educational terms have no architectural decision to regret. Score Burden as 0, not 1.
Trying to make DEBT a single number
Resist. debt(d7/e4/b2/t9) is information. debt = 22 / 36 is information loss. Different debt shapes deserve different remediation strategies; collapsing to one number erases the distinction. If you must rank, sort by the highest single-axis value, not the sum.
Detecting bad scores
The "Common mistakes" above are conceptual. The patterns below are mechanical — testable rules a tool or human auditor can apply to a finished score to flag it for re-review. Each pattern names a specific failure mode, the test that detects it, and the fix.
If you're building a scoring tool or auditing scored content, run these five checks against every score before persisting it.
d4/e3/b2/t5, microservices-too-early is d8/e7/b9/t3, prototype pollution is d7/e4/b2/t9. None of those bunch.semicolon, http_status_codes, php_comment, or iso_8601_date involves no architectural decision; the choice was made by the language designer, not by the developer using the language. Score 0, not "barely any." The same logic generalises to other axes — a pure factual reference probably has e=0 (no fix because no bug), and a static declaration has t=0 (no behaviour to surprise you). The flag is named after Burden because that's where the misjudgment most often shows up, but the principle applies wherever an axis genuinely doesn't apply.These five checks won't catch every bad score — a scorer who reads the rubric and still misjudges will pass all five. But they catch the failure modes that don't require judgment to detect: the lazy scorer, the under-prompted LLM, the confident-but-skipped reviewer. That's where most drift comes from at scale.
Appendix: LLM scoring prompt
For tools implementing AI-assisted scoring, the prompt below is the recommended starting point. It packages the rubric, the rationale requirement, the failure patterns, and the output format into one self-contained system prompt. Adapt as needed but preserve the four required behaviors:
- Full rubric in context — not summarized; the model needs the actual anchor names and examples.
- Required JSON output format — with rationale per axis, validated against the failure patterns.
- Explicit anchor-naming instruction — rationale must quote an anchor by name.
- Explicit 0-handling rule — pure syntax and pure reference get b=0, not b=1.
You are scoring a programming concept against the DEBT rubric, an open standard
for measuring what code concepts cost developers. The rubric has four axes:
d (Detectability) — operational debt. How invisible is misuse to your safety net?
e (Effort) — remediation debt. How much work to fix once spotted?
b (Burden) — structural debt. How much pain does the wrong choice inflict
on every future maintainer?
t (Trap) — cognitive debt. Will a competent dev guess wrong about how
this concept behaves?
Each axis is scored 0–9. Each axis has anchored values at 1, 3, 5, 7, 9. The rest
(2, 4, 6, 8) are interpolated between the surrounding anchors. Score 0 means the
axis genuinely doesn't apply (e.g. pure syntax has b=0).
ANCHORS:
d1: compiler catches (calling an undefined function)
d3: default linter catches (unused variable)
d5: specialist tool catches (SQL injection via concatenation)
d7: only review or runtime catches (race conditions)
d9: effectively undetectable pre-prod (floating-point drift)
e1: one-line patch (add CSRF token to a form)
e3: localised refactor (replace deprecated function in one module)
e5: multi-file refactor (parameterise SQL across a service, 20–50 files)
e7: cross-cutting refactor (sync → async I/O, months)
e9: architectural rewrite (monolith → service-per-database, years)
b1: no future cost (variable naming convention)
b3: mild ongoing friction (test framework choice)
b5: persistent productivity tax (ORM that fights your DB; load-bearing helper)
b7: strategic ceiling (sync-only framework when you'll need concurrency)
b9: project-killing (microservices for 5-person team; load-bearing wrong data model)
t1: no trap (array_keys() returns the keys)
t3: mild gotcha (array_merge() renumbers integer keys)
t5: notable trap (== vs === in PHP/JS)
t7: serious trap (PHP type juggling: "0e1234..." == "0" is true)
t9: catastrophic trap (JS prototype pollution)
PROCEDURE for each axis:
1. Locate the anchor whose example most closely matches the concept's behaviour.
2. Assign that anchor's score, OR ±1 if the concept is slightly different.
3. Write rationale that NAMES the anchor (e.g. "Closest to 'specialist tool catches' (d5),
but most generic linters now catch the obvious case — one step easier").
OUTPUT FORMAT (JSON, no other text):
{
"debt_score": {
"detectability": <int 0-9>,
"effort": <int 0-9>,
"burden": <int 0-9>,
"trap": <int 0-9>,
"rationale": {
"detectability": "<cites a d-anchor by name>",
"effort": "<cites an e-anchor by name>",
"burden": "<cites a b-anchor by name OR explains why b=0>",
"trap": "<cites a t-anchor by name>"
}
}
}
REJECT YOUR OWN OUTPUT IF:
- Rationale doesn't quote or paraphrase an anchor name.
- All four scores fall within a span of 1 (uniformity flag).
- Trap and Detectability move together AND rationale for both reads as "this is bad"
instead of answering the per-axis question (severity-only flag).
- Burden ≥ 3 for pure syntax or pure factual reference (phantom-burden flag).
The concept to score is: <CONCEPT_NAME>
The concept's definition is: <CONCEPT_DEFINITION>
Two notes on adapting this prompt:
- The anchors block is non-negotiable. Removing or summarizing it loses the rubric pin and the LLM will drift. The whole grid (anchor names + examples) must be in the prompt verbatim.
- The "reject your own output" instructions add real value — they make the model do its own first-pass failure-pattern check before returning. In testing, this catches roughly half of uniformity-flag and severity-only failures before the human reviewer sees them.
Versioning
This is DEBT version 1.0. The format and four axes are considered stable. Future revisions may sharpen anchor examples or add an FAQ entry; they will not change the format or remove an axis without a major version bump.
Tooling should treat the version field in the JSON schema as authoritative. Implementations claiming DEBT 1.0 conformance must use exactly the four axes documented here, in the canonical d/e/b/t order, with the score range [0, 9].
See also: why this design · design & badges