AI Observability
debt(d8/e7/b7/t7)
Closest to 'silent in production until users hit it' (d9), adjusted to d8. The detection_hints confirm automated detection is 'no' — there are no tools that automatically flag missing AI observability. The code pattern (LLM API calls with no trace span, no token/cost logging, no quality metric emission) is not caught by any standard linter, SAST, or APM tool. The absence of observability is invisible until quality degrades and users complain. Scored d8 rather than d9 because a careful code reviewer could notice the absence of instrumentation around LLM calls.
Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix sounds simple ('log every prompt-response pair...add an automated quality scorer'), but implementing this properly requires touching every LLM call site across the application, setting up trace spans, adding quality evaluation infrastructure, building prompt versioning, PII scrubbing, and alerting pipelines. This is inherently cross-cutting — it spans multiple files, services, and requires new infrastructure (quality scorers, dashboards, alerting). Not quite e9 architectural rework since the core application architecture doesn't change, but it's a substantial cross-cutting effort.
Closest to 'strong gravitational pull' (b7). Once you establish (or fail to establish) AI observability patterns, every new LLM feature must conform to the instrumentation conventions. The applies_to scope covers web, cli, and queue-worker contexts. Prompt versioning, quality scoring, and cost tracking become load-bearing infrastructure that shapes how every team member writes and deploys LLM-related code. Every prompt change, model swap, or new AI feature is affected. Not quite b9 since it doesn't define the entire system's shape — it's an operational concern layered on top.
Closest to 'serious trap — contradicts how a similar concept works elsewhere' (t7). The misconception is precisely this: developers experienced with traditional APM (Datadog, New Relic, etc.) assume their existing monitoring covers LLM applications. Standard APM captures latency and errors perfectly well, so the 'obvious' approach — treating LLM calls like any other HTTP dependency — feels complete but misses the dominant failure modes: quality degradation, prompt drift, hallucination rate. The trap is serious because the similar concept (traditional observability) actively misleads developers into thinking they're covered.
Also Known As
TL;DR
Explanation
AI observability extends traditional application observability (metrics, logs, traces) to cover the unique failure modes of LLM systems. A standard web request either succeeds or fails with a structured error; an LLM response can succeed at the HTTP level but hallucinate, drift from expected tone, violate content policy, or degrade in quality as prompts evolve. Key observability signals include: token usage and cost per request, latency at the prompt and completion stage, guardrail trigger rate, output quality scores from automated evaluators (LLM-as-judge, embedding similarity to golden answers), prompt version tracking, and user feedback signals (thumbs up/down, edit rate). Tools in this space include LangSmith, Langfuse, Helicone, Arize AI, and Weights & Biases Prompts. Production AI observability requires: structured logging of every prompt-response pair (with PII scrubbing), tracing that correlates a user action to its chain of LLM calls, and alerting on quality regressions when a prompt is changed.
Diagram
flowchart TD
USER[User Request] --> APP[Application]
APP --> GUARD[Input Guardrail]
GUARD --> LLM[LLM API]
LLM --> RESP[Response]
RESP --> APP
APP --> USER
subgraph Observability
TRACE[Trace span per request]
METRICS[Metrics: cost tokens latency]
QUALITY[Quality score evaluator]
LOGS[Structured logs PII-scrubbed]
ALERT[Alerts on regression]
end
APP -->|emit| TRACE & METRICS & QUALITY & LOGS
METRICS & QUALITY --> ALERT
style ALERT fill:#f85149,color:#fff
style QUALITY fill:#238636,color:#fff
Common Misconception
Why It Matters
Common Mistakes
- Logging only errors and not successful prompt-response pairs — quality drift is invisible without a baseline of normal outputs.
- Not tracking token usage per feature — costs grow silently until they cause a budget alert months later.
- Changing prompts without versioning — you cannot correlate a quality drop to a specific prompt change.
- Logging full prompts without PII scrubbing — user data ends up in your observability platform's index.
Avoid When
- Logging raw prompt-response pairs without PII scrubbing — user data will end up in your observability platform.
- Running expensive LLM-as-judge evaluators synchronously on every request — use async evaluation for non-latency-critical quality signals.
When To Use
- Emit a trace span for every LLM call capturing prompt version, model, token usage, cost, and latency.
- Run an automated quality evaluator (embedding similarity or LLM-as-judge) on sampled outputs and track the score over time.
- Version every prompt and correlate quality changes to prompt diffs — treat prompts as code.
- Alert when the rolling quality score or error/block rate exceeds a threshold, just as you would for p99 latency.
Code Examples
// No observability — LLM call is a black box
$response = $llm->complete($prompt);
return $response->text;
// No logging, no cost tracking, no quality measurement
// Structured AI observability with tracing
$span = $tracer->start('llm.complete', [
'prompt.version' => 'summarise-v3',
'model' => 'claude-sonnet-4-20250514',
]);
$response = $llm->complete($prompt);
$span->setAttributes([
'tokens.input' => $response->usage->inputTokens,
'tokens.output' => $response->usage->outputTokens,
'latency.ms' => $span->elapsed(),
'cost.usd' => $response->estimatedCost(),
]);
// Quality signal from automated evaluator
$score = $evaluator->score($prompt, $response->text);
$span->setAttribute('quality.score', $score);
$this->metrics->histogram('llm.quality', $score, ['prompt' => 'summarise-v3']);
$span->end();
return $response->text;