When should you NOT use AI Observability?

Logging raw prompt-response pairs without PII scrubbing — user data will end up in your observability platform. Running expensive LLM-as-judge evaluators synchronously on every request — use async evaluation for non-latency-critical quality signals.

When is AI Observability the right choice?

Emit a trace span for every LLM call capturing prompt version, model, token usage, cost, and latency. Run an automated quality evaluator (embedding similarity or LLM-as-judge) on sampled outputs and track the score over time. Version every prompt and correlate quality changes to prompt diffs — treat prompts as code. Alert when the rolling quality score or error/block rate exceeds a threshold, just as you would for p99 latency.

← Back to glossary

AI Observability

ai_ml Intermediate

debt(d8/e7/b7/t7)

d8 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'silent in production until users hit it' (d9), adjusted to d8. The detection_hints confirm automated detection is 'no' — there are no tools that automatically flag missing AI observability. The code pattern (LLM API calls with no trace span, no token/cost logging, no quality metric emission) is not caught by any standard linter, SAST, or APM tool. The absence of observability is invisible until quality degrades and users complain. Scored d8 rather than d9 because a careful code reviewer could notice the absence of instrumentation around LLM calls.

e7 Effort Remediation debt — work required to fix once spotted

Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix sounds simple ('log every prompt-response pair...add an automated quality scorer'), but implementing this properly requires touching every LLM call site across the application, setting up trace spans, adding quality evaluation infrastructure, building prompt versioning, PII scrubbing, and alerting pipelines. This is inherently cross-cutting — it spans multiple files, services, and requires new infrastructure (quality scorers, dashboards, alerting). Not quite e9 architectural rework since the core application architecture doesn't change, but it's a substantial cross-cutting effort.

b7 Burden Structural debt — long-term weight of choosing wrong

Closest to 'strong gravitational pull' (b7). Once you establish (or fail to establish) AI observability patterns, every new LLM feature must conform to the instrumentation conventions. The applies_to scope covers web, cli, and queue-worker contexts. Prompt versioning, quality scoring, and cost tracking become load-bearing infrastructure that shapes how every team member writes and deploys LLM-related code. Every prompt change, model swap, or new AI feature is affected. Not quite b9 since it doesn't define the entire system's shape — it's an operational concern layered on top.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap — contradicts how a similar concept works elsewhere' (t7). The misconception is precisely this: developers experienced with traditional APM (Datadog, New Relic, etc.) assume their existing monitoring covers LLM applications. Standard APM captures latency and errors perfectly well, so the 'obvious' approach — treating LLM calls like any other HTTP dependency — feels complete but misses the dominant failure modes: quality degradation, prompt drift, hallucination rate. The trap is serious because the similar concept (traditional observability) actively misleads developers into thinking they're covered.

About DEBT scoring → scored by claude-opus-4-6 · 2026-05-06 · reviewed by human

Also Known As

LLM monitoring LLM observability AI monitoring prompt observability

TL;DR

The practice of monitoring, tracing, and evaluating LLM-powered systems in production — covering latency, token costs, prompt drift, output quality, and failure modes.

Explanation

AI observability extends traditional application observability (metrics, logs, traces) to cover the unique failure modes of LLM systems. A standard web request either succeeds or fails with a structured error; an LLM response can succeed at the HTTP level but hallucinate, drift from expected tone, violate content policy, or degrade in quality as prompts evolve. Key observability signals include: token usage and cost per request, latency at the prompt and completion stage, guardrail trigger rate, output quality scores from automated evaluators (LLM-as-judge, embedding similarity to golden answers), prompt version tracking, and user feedback signals (thumbs up/down, edit rate). Tools in this space include LangSmith, Langfuse, Helicone, Arize AI, and Weights & Biases Prompts. Production AI observability requires: structured logging of every prompt-response pair (with PII scrubbing), tracing that correlates a user action to its chain of LLM calls, and alerting on quality regressions when a prompt is changed.

Diagram

flowchart TD
    USER[User Request] --> APP[Application]
    APP --> GUARD[Input Guardrail]
    GUARD --> LLM[LLM API]
    LLM --> RESP[Response]
    RESP --> APP
    APP --> USER
    subgraph Observability
        TRACE[Trace span per request]
        METRICS[Metrics: cost tokens latency]
        QUALITY[Quality score evaluator]
        LOGS[Structured logs PII-scrubbed]
        ALERT[Alerts on regression]
    end
    APP -->|emit| TRACE & METRICS & QUALITY & LOGS
    METRICS & QUALITY --> ALERT
style ALERT fill:#f85149,color:#fff
style QUALITY fill:#238636,color:#fff

Common Misconception

✗ Standard APM tools are sufficient for LLM applications — they capture latency and errors but cannot detect quality degradation, prompt drift, or hallucination rate, which are the dominant failure modes.

Why It Matters

Prompt changes and model updates silently degrade response quality — without AI-specific observability you discover the regression from angry users, not from metrics.

Common Mistakes

Logging only errors and not successful prompt-response pairs — quality drift is invisible without a baseline of normal outputs.
Not tracking token usage per feature — costs grow silently until they cause a budget alert months later.
Changing prompts without versioning — you cannot correlate a quality drop to a specific prompt change.
Logging full prompts without PII scrubbing — user data ends up in your observability platform's index.

Avoid When

Logging raw prompt-response pairs without PII scrubbing — user data will end up in your observability platform.
Running expensive LLM-as-judge evaluators synchronously on every request — use async evaluation for non-latency-critical quality signals.

When To Use

Emit a trace span for every LLM call capturing prompt version, model, token usage, cost, and latency.
Run an automated quality evaluator (embedding similarity or LLM-as-judge) on sampled outputs and track the score over time.
Version every prompt and correlate quality changes to prompt diffs — treat prompts as code.
Alert when the rolling quality score or error/block rate exceeds a threshold, just as you would for p99 latency.

Code Examples

💡 NoteEvery LLM call emits a trace span with token usage, cost, latency, and an automated quality score — enabling dashboards that detect prompt-level regressions.

✗ Vulnerable

// No observability — LLM call is a black box
$response = $llm->complete($prompt);
return $response->text;
// No logging, no cost tracking, no quality measurement

✓ Fixed

// Structured AI observability with tracing
$span = $tracer->start('llm.complete', [
    'prompt.version' => 'summarise-v3',
    'model'          => 'claude-sonnet-4-20250514',
]);

$response = $llm->complete($prompt);

$span->setAttributes([
    'tokens.input'    => $response->usage->inputTokens,
    'tokens.output'   => $response->usage->outputTokens,
    'latency.ms'      => $span->elapsed(),
    'cost.usd'        => $response->estimatedCost(),
]);

// Quality signal from automated evaluator
$score = $evaluator->score($prompt, $response->text);
$span->setAttribute('quality.score', $score);

$this->metrics->histogram('llm.quality', $score, ['prompt' => 'summarise-v3']);
$span->end();

return $response->text;

AI Observability

Also Known As

TL;DR

Explanation

Diagram

Common Misconception

Why It Matters

Common Mistakes

Avoid When

When To Use

Code Examples

References

Tags

AI Observability

Also Known As

TL;DR

Explanation

Diagram

Common Misconception

Why It Matters

Common Mistakes

Avoid When

When To Use

Code Examples

References

Tags

Related Terms