← Home ← Codex ← DEBT
Browse by Category
+ added · updated 7d
← Back to glossary

RAG — Retrieval-Augmented Generation

AI / ML Intermediate
debt(d9/e7/b7/t7)
d9 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'silent in production until users hit it' (d9). Bad RAG retrieval (poor chunking, missing overlap, weak ranking) doesn't surface in any linter or type checker — it manifests as subtly wrong or incomplete LLM answers in production. No detection_hints provided; evaluation requires deliberate recall@K measurement which most teams skip.

e7 Effort Remediation debt — work required to fix once spotted

Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix describes re-chunking, re-embedding, and adjusting retrieval — but changing chunk size or embedding model requires re-ingesting the entire corpus, rebuilding the vector index, and revalidating retrieval quality. Adding a re-ranker touches the whole pipeline. Slightly below architectural rework since the RAG shape stays intact.

b7 Burden Structural debt — long-term weight of choosing wrong

Closest to 'strong gravitational pull' (b7). RAG is load-bearing for LLM apps: chunking strategy, embedding model choice, vector store, and retrieval pipeline shape every feature that touches knowledge. Swapping embedding models forces full re-indexing; the architecture constrains latency, cost, and answer quality across the product.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap' (t7). The misconception explicitly says developers conflate RAG with fine-tuning, choosing the wrong tool for the problem. Additionally, common_mistakes show that 'obvious' defaults (big chunks, no overlap, cosine-only ranking) all degrade quality in non-obvious ways — contradicting the naive intuition that bigger context = better retrieval.

About DEBT scoring →

Also Known As

retrieval augmented generation RAG pipeline retrieval-augmented LLM

TL;DR

An LLM architecture that fetches relevant documents from an external knowledge base before generating a response, grounding answers in retrieved facts rather than training data alone.

Explanation

Retrieval-Augmented Generation combines a retrieval step — searching a vector database or document store for semantically similar content — with a generation step where an LLM synthesises a response using the retrieved context. The retrieved documents are injected into the prompt as grounding material. This solves two core LLM limitations: knowledge cutoff (the model can query up-to-date sources) and hallucination (the model answers from retrieved text rather than from interpolated training patterns). In PHP applications, RAG typically means embedding documents into a vector store like Pinecone or pgvector, then at query time embedding the question, retrieving the top-K similar chunks, and passing them to a hosted LLM API.

Common Misconception

RAG replaces fine-tuning as the way to teach a model about your data. RAG and fine-tuning solve different problems — RAG gives the model access to external facts at inference time, while fine-tuning changes the model's weights to adjust style, format, or domain-specific reasoning patterns. Most production use cases need RAG, not fine-tuning.

Why It Matters

RAG is the dominant production architecture for LLM-powered applications because it avoids retraining while keeping answers current and grounded. Without RAG, LLMs answer from training data that has a knowledge cutoff and no access to your proprietary content. With RAG, the same base model can answer questions about your codebase, documentation, or database — and you can update the knowledge base without retraining anything.

Common Mistakes

  • Chunking documents too coarsely — large chunks reduce retrieval precision because a 2000-token chunk matching a query may contain only one relevant sentence.
  • Ignoring chunk overlap — without overlap between adjacent chunks, sentences that span a chunk boundary lose context in retrieval.
  • Using cosine similarity as the only ranking signal — re-ranking retrieved chunks with a cross-encoder before sending to the LLM significantly improves answer quality.
  • Skipping the retrieval evaluation step — measuring recall@K (did the right chunk get retrieved?) is essential before tuning generation quality.

Code Examples

✗ Vulnerable
// ❌ Injecting entire documents into context instead of relevant chunks
function buildPrompt(string $question): string {
    $docs = file_get_contents('entire_knowledge_base.txt'); // 200k tokens
    return "Use this context:\n$docs\n\nQuestion: $question";
    // Blows context window, dilutes relevance, expensive per-call
}
✓ Fixed
// ✅ RAG with chunked retrieval — only relevant sections in context
function buildRagPrompt(string $question, VectorStore $store): string
{
    // Embed the query
    $queryVector = $embedder->embed($question);

    // Retrieve top-k relevant 256-512 token chunks (not entire documents)
    $chunks = $store->similaritySearch($queryVector, topK: 5);

    // Build grounded context block
    $context = implode("\n---\n", array_column($chunks, 'text'));

    return "Use ONLY the following context to answer.\n"
         . "Context:\n$context\n\n"
         . "Question: $question";
}

// Chunking: split documents into ~400 token overlapping chunks at ingest
function chunkDocument(string $text, int $size = 400, int $overlap = 50): array
{
    $tokens = tokenize($text);
    $chunks = [];
    for ($i = 0; $i < count($tokens); $i += ($size - $overlap)) {
        $chunks[] = implode(' ', array_slice($tokens, $i, $size));
    }
    return $chunks;
}

Added 23 Mar 2026
Views 71
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
1 ping T 0 pings W 1 ping T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 1 ping T 2 pings F 2 pings S 2 pings S 1 ping M 0 pings T 0 pings W 1 ping T 0 pings F 0 pings S 1 ping S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 1 ping S 1 ping M 0 pings T 0 pings W
No pings yet today
No pings yesterday
Amazonbot 15 Google 8 Perplexity 8 Scrapy 7 ChatGPT 5 Ahrefs 5 SEMrush 3 Meta AI 2 Bing 2 Qwen 1 Claude 1 PetalBot 1
crawler 52 crawler_json 6
DEV INTEL Tools & Severity
🔵 Info ⚙ Fix effort: High
⚡ Quick Fix
Split documents into 256–512 token overlapping chunks, embed with the same model used at query time, retrieve top-5, pass as context before the user question


✓ schema.org compliant