RAG — Retrieval-Augmented Generation
debt(d9/e7/b7/t7)
Closest to 'silent in production until users hit it' (d9). Bad RAG retrieval (poor chunking, missing overlap, weak ranking) doesn't surface in any linter or type checker — it manifests as subtly wrong or incomplete LLM answers in production. No detection_hints provided; evaluation requires deliberate recall@K measurement which most teams skip.
Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix describes re-chunking, re-embedding, and adjusting retrieval — but changing chunk size or embedding model requires re-ingesting the entire corpus, rebuilding the vector index, and revalidating retrieval quality. Adding a re-ranker touches the whole pipeline. Slightly below architectural rework since the RAG shape stays intact.
Closest to 'strong gravitational pull' (b7). RAG is load-bearing for LLM apps: chunking strategy, embedding model choice, vector store, and retrieval pipeline shape every feature that touches knowledge. Swapping embedding models forces full re-indexing; the architecture constrains latency, cost, and answer quality across the product.
Closest to 'serious trap' (t7). The misconception explicitly says developers conflate RAG with fine-tuning, choosing the wrong tool for the problem. Additionally, common_mistakes show that 'obvious' defaults (big chunks, no overlap, cosine-only ranking) all degrade quality in non-obvious ways — contradicting the naive intuition that bigger context = better retrieval.
Also Known As
TL;DR
Explanation
Retrieval-Augmented Generation combines a retrieval step — searching a vector database or document store for semantically similar content — with a generation step where an LLM synthesises a response using the retrieved context. The retrieved documents are injected into the prompt as grounding material. This solves two core LLM limitations: knowledge cutoff (the model can query up-to-date sources) and hallucination (the model answers from retrieved text rather than from interpolated training patterns). In PHP applications, RAG typically means embedding documents into a vector store like Pinecone or pgvector, then at query time embedding the question, retrieving the top-K similar chunks, and passing them to a hosted LLM API.
Common Misconception
Why It Matters
Common Mistakes
- Chunking documents too coarsely — large chunks reduce retrieval precision because a 2000-token chunk matching a query may contain only one relevant sentence.
- Ignoring chunk overlap — without overlap between adjacent chunks, sentences that span a chunk boundary lose context in retrieval.
- Using cosine similarity as the only ranking signal — re-ranking retrieved chunks with a cross-encoder before sending to the LLM significantly improves answer quality.
- Skipping the retrieval evaluation step — measuring recall@K (did the right chunk get retrieved?) is essential before tuning generation quality.
Code Examples
// ❌ Injecting entire documents into context instead of relevant chunks
function buildPrompt(string $question): string {
$docs = file_get_contents('entire_knowledge_base.txt'); // 200k tokens
return "Use this context:\n$docs\n\nQuestion: $question";
// Blows context window, dilutes relevance, expensive per-call
}
// ✅ RAG with chunked retrieval — only relevant sections in context
function buildRagPrompt(string $question, VectorStore $store): string
{
// Embed the query
$queryVector = $embedder->embed($question);
// Retrieve top-k relevant 256-512 token chunks (not entire documents)
$chunks = $store->similaritySearch($queryVector, topK: 5);
// Build grounded context block
$context = implode("\n---\n", array_column($chunks, 'text'));
return "Use ONLY the following context to answer.\n"
. "Context:\n$context\n\n"
. "Question: $question";
}
// Chunking: split documents into ~400 token overlapping chunks at ingest
function chunkDocument(string $text, int $size = 400, int $overlap = 50): array
{
$tokens = tokenize($text);
$chunks = [];
for ($i = 0; $i < count($tokens); $i += ($size - $overlap)) {
$chunks[] = implode(' ', array_slice($tokens, $i, $size));
}
return $chunks;
}