Format, axes, anchor examples, scoring procedure, JSON schema, FAQ. This page is the canonical reference. Implementers should pin to it.
A DEBT score is a single string, structured as:
debt(d<0-9>/e<0-9>/b<0-9>/t<0-9>)
Each axis is an integer in [0, 9]:
Letter prefixes must be lowercase: d, e, b, t. Slash-separated. No spaces inside the parens. Order is fixed: d/e/b/t — the four axis letters spell the standard's name, so anyone seeing debt(d_/e_/b_/t_) can decode it without a legend.
debt(d4/e3/b2/t5) — a known security trap, partly caught by tooling, mechanically fixable, no lasting costdebt(d7/e4/b2/t9) — a catastrophic trap, mostly invisible to tools, localised fix, no lasting costdebt(d8/e7/b9/t3) — not a trap technically, but invisible operationally, expensive to undo, project-definingdebt(d1/e1/b0/t1) — benign concept; no surprise, fully detected, trivial to fix, no architectural commitmentEach axis measures one independent kind of debt. They're independent on purpose — if you can predict one from another, the score is wasting a slot. Each axis carries its own identity colour throughout this rubric.
Can the compiler, linter, static analyser, test suite, or peer review catch a misuse before it ships?
Once you know the problem exists in your codebase, how much work to fix it?
If you make the wrong call here at the start of a project, how much pain does that inflict on every future maintainer?
Will a competent developer who's never used this concept guess wrong about how it behaves?
Both flavours score on the same b axis — you're not asked to separate them in the score itself. But naming the flavour in your rationale helps reviewers understand which kind of structural cost you're calling out:
The choice slowly poisons the codebase over months and years. Time is the carrier. Examples: an ORM that fights your DB; microservices adopted before product-market fit; a sync-only framework when you'll need real concurrency. The pain compounds gradually as the codebase grows around the wrong foundation.
The choice is load-bearing across a wide surface, so any bug or change has system-wide blast radius. Surface area is the carrier. Examples: a custom auth helper used by every controller; shared mutable state across a fan-out of services; a load-bearing data model the whole product depends on. The pain is acute the moment something needs to change.
A concept can have either flavour, both, or neither. Score them on the b-axis using the same anchor grid; flag the flavour in the rationale string (e.g. "Closest to b5 'persistent productivity tax' — reach flavour: a load-bearing helper used by every controller").
The reference grid below is the heart of the rubric for scores 1–9. Pick an axis and a score in the picker; the matching cell highlights. For scores at even numbers (2, 4, 6, 8), the row is intentionally blank — those scores are valid but interpolated between the surrounding anchors. Picking an even score highlights it together with the bracketing anchors above and below.
array_keys() returns the keys of an array. Behaves as named.array_merge() renumbers integer keys but preserves string keys. Trips you up once, then you remember.== vs === in PHP/JavaScript. Most experienced devs know. Still produces bugs in auth and validation under deadline pressure."0e1234..." == "0" evaluates to true. Has caused real CVEs (magic-hash bypass). Contradicts how == works in nearly every other language.__proto__ mutates every object. So far from the "objects are isolated" mental model that even seniors ship code with this footgun for years.Ask "what would catch this?" Compiler → 1–2. Any linter → 3. A specialist tool → 5. An expert reviewer → 7. Nothing pre-production → 8–9.
"How long would a solo competent dev take to fix this?" One hour → 1–2. One day → 3. One week → 5. One month → 7. One quarter or more → 9.
"Five years from now, how much pain has this choice inflicted on devs who joined after it was made?" None → 1–2. Constant minor friction → 3–5. Major refactoring inevitable → 6–7. Project-existential → 8–9.
If you can imagine a competent senior dev getting this wrong on a Friday afternoon, score it ≥ 6. If only a junior would miss it, 3–5. If even a careful junior wouldn't miss it, 1–2.
Three concepts profiled end-to-end, so you can see how the four axes combine into a recognisable shape.
Notice how different the shapes are. SQL injection is mid-trap, mid-effort, low-burden. Microservices-too-early is low-trap but huge burden. Prototype pollution is high-trap but low-burden. Three concepts, three different debt profiles, three different remediation strategies. That's what a single severity score cannot tell you.
The rubric only delivers value if the scorer actually uses it. The procedure below is intentionally restrictive — it forces every score to cite which rubric anchor it's calibrating against, so a score isn't just plausible, it's auditable.
Every score must be accompanied by a rationale string that names the anchor it's calibrated against. Without rationale, a score is just a number; with rationale, future-you (or another reviewer) can verify whether the rubric was actually consulted.
semicolon) and pure factual reference (http_status_codes) often score 0 on multiple axes — no architectural commitment to regret (b0), no fix because there's no bug (e0), and so on. Don't use 0 just because the answer is "barely any" — that's a 1. See the N/A band for the full explanation.To make rationale-quality concrete, here's how a well-formed rationale reads for one axis:
That rationale is auditable — another reviewer can read it, look up d3 ("default linter catches") and d5 ("specialist tool catches") in the grid, and decide whether the +/-1 deviation argument holds. A rationale of "hard to detect" would not be auditable; it cites no anchor and could justify any score from 4 to 9.
DEBT is designed to be applied at scale. Scoring 1,000+ entries by hand isn't realistic; the intended workflow uses an LLM to propose scores, with humans reviewing.
The rationale requirement is what makes this tractable. Without anchor names in the output, LLMs drift wildly between runs because nothing pins the score to the rubric. With required rationale that must cite an anchor, the model is forced to consult the grid for each axis — and a human reviewer can spot in seconds whether the cited anchor actually matches the concept.
Recommended canonical storage shape, for tooling that wants to query and filter scores:
{
"debt_score": {
"detectability": 3,
"effort": 2,
"burden": 5,
"trap": 7,
"rationale": {
"detectability": "Closest to 'default linter catches' (d3). Standard ESLint config flags == vs === out of the box.",
"effort": "Closest to 'one-line patch' (e1), but the fix touches every comparison site. Score +1 to e2 for the multi-occurrence overhead.",
"burden": "Closest to 'persistent productivity tax' (b5). The choice between == and === influences every comparison written from this point forward.",
"trap": "Closest to 'serious trap' (t7). Type juggling contradicts how == works in nearly every other language."
},
"scored_at": "2026-05-03",
"scored_by": "claude-haiku-4-5",
"reviewed_by": "human",
"version": "1.0"
}
}
The rationale field is required in DEBT 1.0. Each axis's rationale must name the anchor it's calibrated against (format: "Closest to 'anchor name' (axis-letter+score), [why this score / why ±1 from the anchor]"). Tools that strip rationale or accept scores without anchor names are non-conformant. Without rationale, scores cannot be audited against the rubric, and the system collapses into "four numbers an LLM made up."
For inline display, render as the compact format: debt(d3/e2/b5/t7). The compact form is the citation; the full object — including rationale — is the data of record.
Debt isn't shame, it's accounting. Every codebase has debt. Some is unavoidable, some is well-chosen tradeoff. The point of profiling debt isn't to feel bad about high-debt concepts — it's to know you have them, so you can budget around them.
Difficulty is how hard a concept is to learn. Trap is how much the behaviour contradicts a reasonable assumption. RegExp is hard but predictable. == in PHP is easy but full of traps. Score them independently.
A bug can be invisible to tooling (D=8) but a one-line fix once found (E=1). They're independent dimensions. The hard-to-detect, easy-to-fix bug is operationally different from the easy-to-detect, hard-to-fix one.
Don't score every term ≥ 5 because "all bugs are bad." If most of your terms come out 7–9, the system has lost meaning. Force yourself to use the full range — some concepts really are 1s.
A concept that recently bit you isn't necessarily a 9. Score against the rubric anchors, not against your last incident. The rubric is the calibration; your memory is not.
Pure syntax (semicolon, php_comment), pure factual reference (http_status_codes), and historical/educational terms have no architectural decision to regret. Score Burden as 0, not 1.
Resist. debt(d7/e4/b2/t9) is information. debt = 22 / 36 is information loss. Different debt shapes deserve different remediation strategies; collapsing to one number erases the distinction. If you must rank, sort by the highest single-axis value, not the sum.
The "Common mistakes" above are conceptual. The patterns below are mechanical — testable rules a tool or human auditor can apply to a finished score to flag it for re-review. Each pattern names a specific failure mode, the test that detects it, and the fix.
If you're building a scoring tool or auditing scored content, run these five checks against every score before persisting it.
d4/e3/b2/t5, microservices-too-early is d8/e7/b9/t3, prototype pollution is d7/e4/b2/t9. None of those bunch.semicolon, http_status_codes, php_comment, or iso_8601_date involves no architectural decision; the choice was made by the language designer, not by the developer using the language. Score 0, not "barely any." The same logic generalises to other axes — a pure factual reference probably has e=0 (no fix because no bug), and a static declaration has t=0 (no behaviour to surprise you). The flag is named after Burden because that's where the misjudgment most often shows up, but the principle applies wherever an axis genuinely doesn't apply.These five checks won't catch every bad score — a scorer who reads the rubric and still misjudges will pass all five. But they catch the failure modes that don't require judgment to detect: the lazy scorer, the under-prompted LLM, the confident-but-skipped reviewer. That's where most drift comes from at scale.
For tools implementing AI-assisted scoring, the prompt below is the recommended starting point. It packages the rubric, the rationale requirement, the failure patterns, and the output format into one self-contained system prompt. Adapt as needed but preserve the four required behaviors:
You are scoring a programming concept against the DEBT rubric, an open standard
for measuring what code concepts cost developers. The rubric has four axes:
d (Detectability) — operational debt. How invisible is misuse to your safety net?
e (Effort) — remediation debt. How much work to fix once spotted?
b (Burden) — structural debt. How much pain does the wrong choice inflict
on every future maintainer?
t (Trap) — cognitive debt. Will a competent dev guess wrong about how
this concept behaves?
Each axis is scored 0–9. Each axis has anchored values at 1, 3, 5, 7, 9. The rest
(2, 4, 6, 8) are interpolated between the surrounding anchors. Score 0 means the
axis genuinely doesn't apply (e.g. pure syntax has b=0).
ANCHORS:
d1: compiler catches (calling an undefined function)
d3: default linter catches (unused variable)
d5: specialist tool catches (SQL injection via concatenation)
d7: only review or runtime catches (race conditions)
d9: effectively undetectable pre-prod (floating-point drift)
e1: one-line patch (add CSRF token to a form)
e3: localised refactor (replace deprecated function in one module)
e5: multi-file refactor (parameterise SQL across a service, 20–50 files)
e7: cross-cutting refactor (sync → async I/O, months)
e9: architectural rewrite (monolith → service-per-database, years)
b1: no future cost (variable naming convention)
b3: mild ongoing friction (test framework choice)
b5: persistent productivity tax (ORM that fights your DB; load-bearing helper)
b7: strategic ceiling (sync-only framework when you'll need concurrency)
b9: project-killing (microservices for 5-person team; load-bearing wrong data model)
t1: no trap (array_keys() returns the keys)
t3: mild gotcha (array_merge() renumbers integer keys)
t5: notable trap (== vs === in PHP/JS)
t7: serious trap (PHP type juggling: "0e1234..." == "0" is true)
t9: catastrophic trap (JS prototype pollution)
PROCEDURE for each axis:
1. Locate the anchor whose example most closely matches the concept's behaviour.
2. Assign that anchor's score, OR ±1 if the concept is slightly different.
3. Write rationale that NAMES the anchor (e.g. "Closest to 'specialist tool catches' (d5),
but most generic linters now catch the obvious case — one step easier").
OUTPUT FORMAT (JSON, no other text):
{
"debt_score": {
"detectability": <int 0-9>,
"effort": <int 0-9>,
"burden": <int 0-9>,
"trap": <int 0-9>,
"rationale": {
"detectability": "<cites a d-anchor by name>",
"effort": "<cites an e-anchor by name>",
"burden": "<cites a b-anchor by name OR explains why b=0>",
"trap": "<cites a t-anchor by name>"
}
}
}
REJECT YOUR OWN OUTPUT IF:
- Rationale doesn't quote or paraphrase an anchor name.
- All four scores fall within a span of 1 (uniformity flag).
- Trap and Detectability move together AND rationale for both reads as "this is bad"
instead of answering the per-axis question (severity-only flag).
- Burden ≥ 3 for pure syntax or pure factual reference (phantom-burden flag).
The concept to score is: <CONCEPT_NAME>
The concept's definition is: <CONCEPT_DEFINITION>
Two notes on adapting this prompt:
This is DEBT version 1.0. The format and four axes are considered stable. Future revisions may sharpen anchor examples or add an FAQ entry; they will not change the format or remove an axis without a major version bump.
Tooling should treat the version field in the JSON schema as authoritative. Implementations claiming DEBT 1.0 conformance must use exactly the four axes documented here, in the canonical d/e/b/t order, with the score range [0, 9].
See also: why this design · design & badges