← CodeClarityLab Home
Browse by Category
+ added · updated 7d
← Back to glossary

AI Observability

ai_ml Intermediate
debt(d8/e7/b7/t7)
d8 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'silent in production until users hit it' (d9), adjusted to d8. The detection_hints confirm automated detection is 'no' — there are no tools that automatically flag missing AI observability. The code pattern (LLM API calls with no trace span, no token/cost logging, no quality metric emission) is not caught by any standard linter, SAST, or APM tool. The absence of observability is invisible until quality degrades and users complain. Scored d8 rather than d9 because a careful code reviewer could notice the absence of instrumentation around LLM calls.

e7 Effort Remediation debt — work required to fix once spotted

Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix sounds simple ('log every prompt-response pair...add an automated quality scorer'), but implementing this properly requires touching every LLM call site across the application, setting up trace spans, adding quality evaluation infrastructure, building prompt versioning, PII scrubbing, and alerting pipelines. This is inherently cross-cutting — it spans multiple files, services, and requires new infrastructure (quality scorers, dashboards, alerting). Not quite e9 architectural rework since the core application architecture doesn't change, but it's a substantial cross-cutting effort.

b7 Burden Structural debt — long-term weight of choosing wrong

Closest to 'strong gravitational pull' (b7). Once you establish (or fail to establish) AI observability patterns, every new LLM feature must conform to the instrumentation conventions. The applies_to scope covers web, cli, and queue-worker contexts. Prompt versioning, quality scoring, and cost tracking become load-bearing infrastructure that shapes how every team member writes and deploys LLM-related code. Every prompt change, model swap, or new AI feature is affected. Not quite b9 since it doesn't define the entire system's shape — it's an operational concern layered on top.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap — contradicts how a similar concept works elsewhere' (t7). The misconception is precisely this: developers experienced with traditional APM (Datadog, New Relic, etc.) assume their existing monitoring covers LLM applications. Standard APM captures latency and errors perfectly well, so the 'obvious' approach — treating LLM calls like any other HTTP dependency — feels complete but misses the dominant failure modes: quality degradation, prompt drift, hallucination rate. The trap is serious because the similar concept (traditional observability) actively misleads developers into thinking they're covered.

About DEBT scoring →

Also Known As

LLM monitoring LLM observability AI monitoring prompt observability

TL;DR

The practice of monitoring, tracing, and evaluating LLM-powered systems in production — covering latency, token costs, prompt drift, output quality, and failure modes.

Explanation

AI observability extends traditional application observability (metrics, logs, traces) to cover the unique failure modes of LLM systems. A standard web request either succeeds or fails with a structured error; an LLM response can succeed at the HTTP level but hallucinate, drift from expected tone, violate content policy, or degrade in quality as prompts evolve. Key observability signals include: token usage and cost per request, latency at the prompt and completion stage, guardrail trigger rate, output quality scores from automated evaluators (LLM-as-judge, embedding similarity to golden answers), prompt version tracking, and user feedback signals (thumbs up/down, edit rate). Tools in this space include LangSmith, Langfuse, Helicone, Arize AI, and Weights & Biases Prompts. Production AI observability requires: structured logging of every prompt-response pair (with PII scrubbing), tracing that correlates a user action to its chain of LLM calls, and alerting on quality regressions when a prompt is changed.

Diagram

flowchart TD
    USER[User Request] --> APP[Application]
    APP --> GUARD[Input Guardrail]
    GUARD --> LLM[LLM API]
    LLM --> RESP[Response]
    RESP --> APP
    APP --> USER
    subgraph Observability
        TRACE[Trace span per request]
        METRICS[Metrics: cost tokens latency]
        QUALITY[Quality score evaluator]
        LOGS[Structured logs PII-scrubbed]
        ALERT[Alerts on regression]
    end
    APP -->|emit| TRACE & METRICS & QUALITY & LOGS
    METRICS & QUALITY --> ALERT
style ALERT fill:#f85149,color:#fff
style QUALITY fill:#238636,color:#fff

Common Misconception

Standard APM tools are sufficient for LLM applications — they capture latency and errors but cannot detect quality degradation, prompt drift, or hallucination rate, which are the dominant failure modes.

Why It Matters

Prompt changes and model updates silently degrade response quality — without AI-specific observability you discover the regression from angry users, not from metrics.

Common Mistakes

  • Logging only errors and not successful prompt-response pairs — quality drift is invisible without a baseline of normal outputs.
  • Not tracking token usage per feature — costs grow silently until they cause a budget alert months later.
  • Changing prompts without versioning — you cannot correlate a quality drop to a specific prompt change.
  • Logging full prompts without PII scrubbing — user data ends up in your observability platform's index.

Avoid When

  • Logging raw prompt-response pairs without PII scrubbing — user data will end up in your observability platform.
  • Running expensive LLM-as-judge evaluators synchronously on every request — use async evaluation for non-latency-critical quality signals.

When To Use

  • Emit a trace span for every LLM call capturing prompt version, model, token usage, cost, and latency.
  • Run an automated quality evaluator (embedding similarity or LLM-as-judge) on sampled outputs and track the score over time.
  • Version every prompt and correlate quality changes to prompt diffs — treat prompts as code.
  • Alert when the rolling quality score or error/block rate exceeds a threshold, just as you would for p99 latency.

Code Examples

💡 Note
Every LLM call emits a trace span with token usage, cost, latency, and an automated quality score — enabling dashboards that detect prompt-level regressions.
✗ Vulnerable
// No observability — LLM call is a black box
$response = $llm->complete($prompt);
return $response->text;
// No logging, no cost tracking, no quality measurement
✓ Fixed
// Structured AI observability with tracing
$span = $tracer->start('llm.complete', [
    'prompt.version' => 'summarise-v3',
    'model'          => 'claude-sonnet-4-20250514',
]);

$response = $llm->complete($prompt);

$span->setAttributes([
    'tokens.input'    => $response->usage->inputTokens,
    'tokens.output'   => $response->usage->outputTokens,
    'latency.ms'      => $span->elapsed(),
    'cost.usd'        => $response->estimatedCost(),
]);

// Quality signal from automated evaluator
$score = $evaluator->score($prompt, $response->text);
$span->setAttribute('quality.score', $score);

$this->metrics->histogram('llm.quality', $score, ['prompt' => 'summarise-v3']);
$span->end();

return $response->text;

Added 29 Mar 2026
Views 23
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings F 0 pings S 0 pings S 3 pings M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 1 ping M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 2 pings M 0 pings T 0 pings W 1 ping T 0 pings F 0 pings S 0 pings S 1 ping M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S
No pings yet today
No pings yesterday
Amazonbot 8 Perplexity 4 Unknown AI 2 Google 1 ChatGPT 1 Ahrefs 1
crawler 17
DEV INTEL Tools & Severity
🟠 High ⚙ Fix effort: Medium
⚡ Quick Fix
Log every prompt-response pair with token counts, latency, and prompt version; add an automated quality scorer and alert when the rolling average drops below your baseline
📦 Applies To
any web cli queue-worker
🔗 Prerequisites
🔍 Detection Hints
LLM API calls with no surrounding trace span, no token/cost logging, and no quality metric emission
Auto-detectable: ✗ No
⚠ Related Problems
🤖 AI Agent
Confidence: Medium False Positives: Low ✗ Manual fix Fix: Medium Context: Function Tests: Update

✓ schema.org compliant