Specification · v1.0

DEBT — the formal spec.

Format, axes, anchor examples, scoring procedure, JSON schema, FAQ. This page is the canonical reference. Implementers should pin to it.

Format

A DEBT score is a single string, structured as:

debt(d<0-9>/e<0-9>/b<0-9>/t<0-9>)

Each axis is an integer in [0, 9]:

Letter prefixes must be lowercase: d, e, b, t. Slash-separated. No spaces inside the parens. Order is fixed: d/e/b/t — the four axis letters spell the standard's name, so anyone seeing debt(d_/e_/b_/t_) can decode it without a legend.

Examples

The four axes

Each axis measures one independent kind of debt. They're independent on purpose — if you can predict one from another, the score is wasting a slot. Each axis carries its own identity colour throughout this rubric.

d
Detectability
Operational debt

Can the compiler, linter, static analyser, test suite, or peer review catch a misuse before it ships?

e
Effort
Remediation debt

Once you know the problem exists in your codebase, how much work to fix it?

b
Burden
Structural debt

If you make the wrong call here at the start of a project, how much pain does that inflict on every future maintainer?

t
Trap
Cognitive debt

Will a competent developer who's never used this concept guess wrong about how it behaves?

Burden measures two related but distinct phenomena

Both flavours score on the same b axis — you're not asked to separate them in the score itself. But naming the flavour in your rationale helps reviewers understand which kind of structural cost you're calling out:

Decay burden

The choice slowly poisons the codebase over months and years. Time is the carrier. Examples: an ORM that fights your DB; microservices adopted before product-market fit; a sync-only framework when you'll need real concurrency. The pain compounds gradually as the codebase grows around the wrong foundation.

Reach burden

The choice is load-bearing across a wide surface, so any bug or change has system-wide blast radius. Surface area is the carrier. Examples: a custom auth helper used by every controller; shared mutable state across a fan-out of services; a load-bearing data model the whole product depends on. The pain is acute the moment something needs to change.

A concept can have either flavour, both, or neither. Score them on the b-axis using the same anchor grid; flag the flavour in the rationale string (e.g. "Closest to b5 'persistent productivity tax' — reach flavour: a load-bearing helper used by every controller").

Anchors — the rubric

The reference grid below is the heart of the rubric for scores 1–9. Pick an axis and a score in the picker; the matching cell highlights. For scores at even numbers (2, 4, 6, 8), the row is intentionally blank — those scores are valid but interpolated between the surrounding anchors. Picking an even score highlights it together with the bracketing anchors above and below.

Score
d — Detectability
e — Effort
b — Burden
t — Trap
1
Compiler catches
Calling an undefined function. Fails before code runs. Loud, immediate.
One-line patch
Add a missing CSRF token to a form. Hour or less.
No future cost
Variable naming convention. Renaming is cheap. Future devs adjust quickly.
No trap
array_keys() returns the keys of an array. Behaves as named.
2
3
Default linter catches
Unused variable, missing return type. Standard ESLint / PHPStan config flags it.
Localised refactor
Replace a deprecated function across one module. Mechanical. Half a day.
Mild ongoing friction
Choice of test framework (Jest vs Mocha). Future devs grumble; switching is doable but rarely worth it.
Mild gotcha
array_merge() renumbers integer keys but preserves string keys. Trips you up once, then you remember.
4
5
Specialist tool catches
SQL injection via string concatenation. Semgrep or SonarQube finds it — if configured. Generic linter misses it.
Multi-file refactor
Switch from raw SQL to parameterised queries across a service. Touches 20–50 files. A week or two.
Persistent productivity tax
An ORM that doesn't support your DB's advanced features (decay). Or: a custom auth helper used by every controller (reach — one bug, system-wide).
Notable trap
== vs === in PHP/JavaScript. Most experienced devs know. Still produces bugs in auth and validation under deadline pressure.
6
7
Only review or runtime catches
Race condition in a 2-thread test that passes 95% of the time. No static tool reliably catches it. Code review by a concurrency expert might.
Cross-cutting refactor
Migrate from synchronous to async I/O across a service. Touches every call site. Test suite needs rewriting. Months.
Strategic ceiling
Synchronous-only framework when you'll need real concurrency at scale (decay). Or: shared mutable state across a fan-out of services (reach — debugging requires understanding the whole system).
Serious trap
PHP type juggling: "0e1234..." == "0" evaluates to true. Has caused real CVEs (magic-hash bypass). Contradicts how == works in nearly every other language.
8
9
Effectively undetectable pre-prod
Floating-point accumulation drift in a long-running financial calc. No tool, no review, no test will catch it. Only telemetry over time, after damage is done.
Architectural rewrite
Migrate from monolith with shared DB to service-per-database. Multi-team, multi-quarter. Hybrid bugs mid-flight. Real cost: years.
Project-killing
Microservices for a 5-person team before product-market fit (decay — team velocity collapses). Or: a load-bearing data model the whole product depends on, and it's wrong (reach — can't fix without rewriting everything).
Catastrophic trap
JavaScript prototype pollution: assigning to __proto__ mutates every object. So far from the "objects are isolated" mental model that even seniors ship code with this footgun for years.

Quick tests — one per axis

d

Ask "what would catch this?" Compiler → 1–2. Any linter → 3. A specialist tool → 5. An expert reviewer → 7. Nothing pre-production → 8–9.

e

"How long would a solo competent dev take to fix this?" One hour → 1–2. One day → 3. One week → 5. One month → 7. One quarter or more → 9.

b

"Five years from now, how much pain has this choice inflicted on devs who joined after it was made?" None → 1–2. Constant minor friction → 3–5. Major refactoring inevitable → 6–7. Project-existential → 8–9.

t

If you can imagine a competent senior dev getting this wrong on a Friday afternoon, score it ≥ 6. If only a junior would miss it, 3–5. If even a careful junior wouldn't miss it, 1–2.

Worked examples

Three concepts profiled end-to-end, so you can see how the four axes combine into a recognisable shape.

sql_injection
debt(d4/e3/b2/t5)
d4
Specialist linters catch the obvious cases. Subtle ones (dynamic table names, ORDER BY) slip through.
e3
Per-query fix is mechanical: parameterise. Audit across the codebase is the real cost.
b2
No long-term structural pain — once parameterised, it stays fixed.
t5
Most experienced devs know about it but still write vulnerable code under deadline. Notable trap.
microservices_for_small_teams
debt(d8/e7/b9/t3)
d8
Almost no tooling tells you "this team is too small for this architecture." You learn it from velocity collapse.
e7
Going back to a monolith is months of disentangling, with hybrid bugs along the way.
b9
Project-existential. Many small teams die at this rock before realising it.
t3
Not surprising at the technical level — APIs work as documented.
javascript_prototype_pollution
debt(d7/e4/b2/t9)
d7
Static analysis catches some patterns. Subtle propagation through user input usually misses.
e4
Localised: validate input, freeze prototypes. A few hours to a couple of days per affected library.
b2
No long-term structural cost. Once patched, it stays patched.
t9
Catastrophic trap. Mutating one object's prototype changes every object — behaviour is far from any reasonable mental model.

Notice how different the shapes are. SQL injection is mid-trap, mid-effort, low-burden. Microservices-too-early is low-trap but huge burden. Prototype pollution is high-trap but low-burden. Three concepts, three different debt profiles, three different remediation strategies. That's what a single severity score cannot tell you.

Scoring with rationale

The rubric only delivers value if the scorer actually uses it. The procedure below is intentionally restrictive — it forces every score to cite which rubric anchor it's calibrating against, so a score isn't just plausible, it's auditable.

Every score must be accompanied by a rationale string that names the anchor it's calibrated against. Without rationale, a score is just a number; with rationale, future-you (or another reviewer) can verify whether the rubric was actually consulted.

The procedure

  1. Read the concept's definition first. Ground yourself in what the concept actually is, not what you remember.
  2. For each axis, locate the closest anchor. Open the anchor grid above. For axis D, find the row whose anchor name and example most closely match this concept's behaviour. Note the anchor's name and its score (1, 3, 5, 7, or 9).
  3. Assign the anchor's score ±1. If the concept matches the anchor exactly, use the anchor's score. If it's slightly worse than the anchor, score one level higher (an even number between this anchor and the next). If it's slightly better, score one level lower. You may only deviate by 1 from the closest anchor. When a concept genuinely sits between two anchors (e.g. somewhere between d3 and d5), cite whichever anchor your reasoning departs from — if your argument is "easier than specialist tool but harder than default linter," cite d5 with −1 if the concept is closer to specialist territory, or d3 with +1 if closer to default-linter territory. Pick the anchor that grounds your specific argument.
  4. Write a rationale string that names the anchor and explains the ±1 deviation if any. Example for d4: "Closest to specialist tool catches (d5), but most generic linters also catch the obvious case — one step easier."
  5. Use 0 only when the axis question genuinely cannot be asked. Pure syntax (semicolon) and pure factual reference (http_status_codes) often score 0 on multiple axes — no architectural commitment to regret (b0), no fix because there's no bug (e0), and so on. Don't use 0 just because the answer is "barely any" — that's a 1. See the N/A band for the full explanation.
  6. Sanity-check the shape. If all four scores cluster within 1 of each other (e.g. all 2–3, all 4–5), you probably didn't score them independently. Re-read each axis question and try again. See detecting bad scores below for specific patterns.

Worked rationale — SQL injection at d4

To make rationale-quality concrete, here's how a well-formed rationale reads for one axis:

d4
Closest anchor: specialist tool catches (d5). Semgrep and SonarQube find concatenation-based SQL injection when configured.
Why −1: Many default linter configurations now ship with basic injection rules out of the box, putting common cases one step easier than "specialist tool" implies. Subtle cases (dynamic ORDER BY, table-name interpolation) still need specialist tools.

That rationale is auditable — another reviewer can read it, look up d3 ("default linter catches") and d5 ("specialist tool catches") in the grid, and decide whether the +/-1 deviation argument holds. A rationale of "hard to detect" would not be auditable; it cites no anchor and could justify any score from 4 to 9.

AI-assisted scoring

DEBT is designed to be applied at scale. Scoring 1,000+ entries by hand isn't realistic; the intended workflow uses an LLM to propose scores, with humans reviewing.

The rationale requirement is what makes this tractable. Without anchor names in the output, LLMs drift wildly between runs because nothing pins the score to the rubric. With required rationale that must cite an anchor, the model is forced to consult the grid for each axis — and a human reviewer can spot in seconds whether the cited anchor actually matches the concept.

  1. An LLM proposes scores using a prompt that includes the full anchor grid + a required output format (see the LLM scoring prompt appendix).
  2. The model returns score + rationale per axis. Rationale must name the anchor it calibrated against. Outputs without anchor names are rejected automatically.
  3. A human reviews and accepts, edits, or rejects. The rationale strings make this fast — reviewers verify the cited anchor matches the concept, rather than re-doing the rubric lookup themselves.
  4. Audit periodically. Sample 5–10% of scored entries quarterly. Look specifically for the failure patterns described below. Drift in any axis means the rubric needs sharpening or the LLM prompt needs tightening.

JSON schema

Recommended canonical storage shape, for tooling that wants to query and filter scores:

{
  "debt_score": {
    "detectability": 3,
    "effort":        2,
    "burden":        5,
    "trap":          7,
    "rationale": {
      "detectability": "Closest to 'default linter catches' (d3). Standard ESLint config flags == vs === out of the box.",
      "effort":        "Closest to 'one-line patch' (e1), but the fix touches every comparison site. Score +1 to e2 for the multi-occurrence overhead.",
      "burden":        "Closest to 'persistent productivity tax' (b5). The choice between == and === influences every comparison written from this point forward.",
      "trap":          "Closest to 'serious trap' (t7). Type juggling contradicts how == works in nearly every other language."
    },
    "scored_at": "2026-05-03",
    "scored_by": "claude-haiku-4-5",
    "reviewed_by": "human",
    "version": "1.0"
  }
}

The rationale field is required in DEBT 1.0. Each axis's rationale must name the anchor it's calibrated against (format: "Closest to 'anchor name' (axis-letter+score), [why this score / why ±1 from the anchor]"). Tools that strip rationale or accept scores without anchor names are non-conformant. Without rationale, scores cannot be audited against the rubric, and the system collapses into "four numbers an LLM made up."

For inline display, render as the compact format: debt(d3/e2/b5/t7). The compact form is the citation; the full object — including rationale — is the data of record.

Common mistakes

Treating debt as moral failure

Debt isn't shame, it's accounting. Every codebase has debt. Some is unavoidable, some is well-chosen tradeoff. The point of profiling debt isn't to feel bad about high-debt concepts — it's to know you have them, so you can budget around them.

Conflating Trap with Difficulty

Difficulty is how hard a concept is to learn. Trap is how much the behaviour contradicts a reasonable assumption. RegExp is hard but predictable. == in PHP is easy but full of traps. Score them independently.

Conflating Detectability with Effort

A bug can be invisible to tooling (D=8) but a one-line fix once found (E=1). They're independent dimensions. The hard-to-detect, easy-to-fix bug is operationally different from the easy-to-detect, hard-to-fix one.

Inflation drift

Don't score every term ≥ 5 because "all bugs are bad." If most of your terms come out 7–9, the system has lost meaning. Force yourself to use the full range — some concepts really are 1s.

Recency bias

A concept that recently bit you isn't necessarily a 9. Score against the rubric anchors, not against your last incident. The rubric is the calibration; your memory is not.

Burden on terms where it doesn't apply

Pure syntax (semicolon, php_comment), pure factual reference (http_status_codes), and historical/educational terms have no architectural decision to regret. Score Burden as 0, not 1.

Trying to make DEBT a single number

Resist. debt(d7/e4/b2/t9) is information. debt = 22 / 36 is information loss. Different debt shapes deserve different remediation strategies; collapsing to one number erases the distinction. If you must rank, sort by the highest single-axis value, not the sum.

Detecting bad scores

The "Common mistakes" above are conceptual. The patterns below are mechanical — testable rules a tool or human auditor can apply to a finished score to flag it for re-review. Each pattern names a specific failure mode, the test that detects it, and the fix.

If you're building a scoring tool or auditing scored content, run these five checks against every score before persisting it.

1. Uniformity flag
Test: all four scores fall within a span of 1 (e.g. all 2–3, all 4–5, all 7–8).
Why it's wrong: the four axes are independent by design. If they all bunch up, the scorer almost certainly collapsed them into a single severity feeling instead of evaluating each axis question separately. Real concepts produce distinctive shapes — SQL injection is d4/e3/b2/t5, microservices-too-early is d8/e7/b9/t3, prototype pollution is d7/e4/b2/t9. None of those bunch.
Fix: re-score each axis with the rubric grid open. Read the axis question first, then locate the closest anchor, then write rationale. If the scores still bunch, the concept may legitimately be uniform — but the rationale strings should make that clear.
2. Default-3 flag
Test: any score is exactly 3 with rationale that doesn't cite the d3/e3/b3/t3 anchor by name.
Why it's wrong: 3 is the "I don't know, somewhere mild" default. Scorers (and LLMs) reach for it when they haven't actually consulted the rubric. The rationale-naming requirement catches this: a real 3-score rationale will quote the anchor ("default linter catches" / "localised refactor" / "mild ongoing friction" / "mild gotcha"). A fake 3-score rationale waves at "this seems mild."
Fix: if the rationale doesn't name the anchor, reject and re-score. The score may legitimately end up at 3 once the anchor is consulted — but it might also become 1, 5, or 7. The rubric work has to happen.
3. Severity-only flag
Test: Trap (t) and Detectability (d) values track each other within 1, AND the rationale for both reads as "this is bad" rather than answering the per-axis question.
Why it's wrong: Trap and Detectability are independent. A high-Trap concept can be either fully detectable (high T, low D) or invisible to tools (high T, high D). If they always move together, the scorer is treating DEBT like CVSS — collapsing everything into "how dangerous." Re-read the axis questions: Trap is "will a competent dev guess wrong?", Detectability is "can the toolchain catch it?" Different questions, different answers.
Fix: for each axis, write rationale that answers that axis's specific question. If you can't answer Detectability without invoking how-bad-it-is, you haven't separated the axes yet.
4. Phantom-burden flag
Test: a pure-syntax or pure-reference concept has Burden ≥ 3 — or, more generally, any axis at ≥ 3 when the axis question can't really be asked of this kind of concept.
Why it's wrong: Burden measures architectural commitment. A concept like semicolon, http_status_codes, php_comment, or iso_8601_date involves no architectural decision; the choice was made by the language designer, not by the developer using the language. Score 0, not "barely any." The same logic generalises to other axes — a pure factual reference probably has e=0 (no fix because no bug), and a static declaration has t=0 (no behaviour to surprise you). The flag is named after Burden because that's where the misjudgment most often shows up, but the principle applies wherever an axis genuinely doesn't apply.
Fix: ask "if a developer chooses to use / not-use this concept, does that choice meaningfully shape their codebase for years?" If the answer is no, b = 0. Reserve b = 1 for choices that genuinely shape future maintenance even slightly (like a naming convention). See the N/A band for when to use 0 across all axes.
5. Anchor-skip flag
Test: any rationale string fails to name an anchor from the rubric ("default linter catches", "specialist tool catches", "one-line patch", etc.).
Why it's wrong: the whole point of the rationale requirement is to force the scorer to consult the rubric. If rationale doesn't reference an anchor, the rubric was bypassed. The score might happen to be correct, but it's unverifiable — and unverifiable scores in a system meant to scale across thousands of entries are a debt the scorers are accumulating against future maintainers (a fittingly DEBT-shaped problem).
Fix: reject any score whose rationale doesn't quote or paraphrase a named anchor from the grid above. This is the strongest single check. Tools should enforce it automatically.

These five checks won't catch every bad score — a scorer who reads the rubric and still misjudges will pass all five. But they catch the failure modes that don't require judgment to detect: the lazy scorer, the under-prompted LLM, the confident-but-skipped reviewer. That's where most drift comes from at scale.

Appendix: LLM scoring prompt

For tools implementing AI-assisted scoring, the prompt below is the recommended starting point. It packages the rubric, the rationale requirement, the failure patterns, and the output format into one self-contained system prompt. Adapt as needed but preserve the four required behaviors:

  1. Full rubric in context — not summarized; the model needs the actual anchor names and examples.
  2. Required JSON output format — with rationale per axis, validated against the failure patterns.
  3. Explicit anchor-naming instruction — rationale must quote an anchor by name.
  4. Explicit 0-handling rule — pure syntax and pure reference get b=0, not b=1.
You are scoring a programming concept against the DEBT rubric, an open standard
for measuring what code concepts cost developers. The rubric has four axes:

  d (Detectability) — operational debt. How invisible is misuse to your safety net?
  e (Effort)        — remediation debt. How much work to fix once spotted?
  b (Burden)        — structural debt. How much pain does the wrong choice inflict
                      on every future maintainer?
  t (Trap)          — cognitive debt. Will a competent dev guess wrong about how
                      this concept behaves?

Each axis is scored 0–9. Each axis has anchored values at 1, 3, 5, 7, 9. The rest
(2, 4, 6, 8) are interpolated between the surrounding anchors. Score 0 means the
axis genuinely doesn't apply (e.g. pure syntax has b=0).

ANCHORS:

  d1: compiler catches (calling an undefined function)
  d3: default linter catches (unused variable)
  d5: specialist tool catches (SQL injection via concatenation)
  d7: only review or runtime catches (race conditions)
  d9: effectively undetectable pre-prod (floating-point drift)

  e1: one-line patch (add CSRF token to a form)
  e3: localised refactor (replace deprecated function in one module)
  e5: multi-file refactor (parameterise SQL across a service, 20–50 files)
  e7: cross-cutting refactor (sync → async I/O, months)
  e9: architectural rewrite (monolith → service-per-database, years)

  b1: no future cost (variable naming convention)
  b3: mild ongoing friction (test framework choice)
  b5: persistent productivity tax (ORM that fights your DB; load-bearing helper)
  b7: strategic ceiling (sync-only framework when you'll need concurrency)
  b9: project-killing (microservices for 5-person team; load-bearing wrong data model)

  t1: no trap (array_keys() returns the keys)
  t3: mild gotcha (array_merge() renumbers integer keys)
  t5: notable trap (== vs === in PHP/JS)
  t7: serious trap (PHP type juggling: "0e1234..." == "0" is true)
  t9: catastrophic trap (JS prototype pollution)

PROCEDURE for each axis:
  1. Locate the anchor whose example most closely matches the concept's behaviour.
  2. Assign that anchor's score, OR ±1 if the concept is slightly different.
  3. Write rationale that NAMES the anchor (e.g. "Closest to 'specialist tool catches' (d5),
     but most generic linters now catch the obvious case — one step easier").

OUTPUT FORMAT (JSON, no other text):

  {
    "debt_score": {
      "detectability": <int 0-9>,
      "effort":        <int 0-9>,
      "burden":        <int 0-9>,
      "trap":          <int 0-9>,
      "rationale": {
        "detectability": "<cites a d-anchor by name>",
        "effort":        "<cites an e-anchor by name>",
        "burden":        "<cites a b-anchor by name OR explains why b=0>",
        "trap":          "<cites a t-anchor by name>"
      }
    }
  }

REJECT YOUR OWN OUTPUT IF:
  - Rationale doesn't quote or paraphrase an anchor name.
  - All four scores fall within a span of 1 (uniformity flag).
  - Trap and Detectability move together AND rationale for both reads as "this is bad"
    instead of answering the per-axis question (severity-only flag).
  - Burden ≥ 3 for pure syntax or pure factual reference (phantom-burden flag).

The concept to score is: <CONCEPT_NAME>
The concept's definition is: <CONCEPT_DEFINITION>

Two notes on adapting this prompt:

Versioning

This is DEBT version 1.0. The format and four axes are considered stable. Future revisions may sharpen anchor examples or add an FAQ entry; they will not change the format or remove an axis without a major version bump.

Tooling should treat the version field in the JSON schema as authoritative. Implementations claiming DEBT 1.0 conformance must use exactly the four axes documented here, in the canonical d/e/b/t order, with the score range [0, 9].

See also: why this design · design & badges