← CodeClarityLab Home
Browse by Category
+ added · updated 7d
← Back to glossary

RAG — Retrieval-Augmented Generation

ai_ml Intermediate

Also Known As

retrieval augmented generation RAG pipeline retrieval-augmented LLM

TL;DR

An LLM architecture that fetches relevant documents from an external knowledge base before generating a response, grounding answers in retrieved facts rather than training data alone.

Explanation

Retrieval-Augmented Generation combines a retrieval step — searching a vector database or document store for semantically similar content — with a generation step where an LLM synthesises a response using the retrieved context. The retrieved documents are injected into the prompt as grounding material. This solves two core LLM limitations: knowledge cutoff (the model can query up-to-date sources) and hallucination (the model answers from retrieved text rather than from interpolated training patterns). In PHP applications, RAG typically means embedding documents into a vector store like Pinecone or pgvector, then at query time embedding the question, retrieving the top-K similar chunks, and passing them to a hosted LLM API.

Common Misconception

RAG replaces fine-tuning as the way to teach a model about your data. RAG and fine-tuning solve different problems — RAG gives the model access to external facts at inference time, while fine-tuning changes the model's weights to adjust style, format, or domain-specific reasoning patterns. Most production use cases need RAG, not fine-tuning.

Why It Matters

RAG is the dominant production architecture for LLM-powered applications because it avoids retraining while keeping answers current and grounded. Without RAG, LLMs answer from training data that has a knowledge cutoff and no access to your proprietary content. With RAG, the same base model can answer questions about your codebase, documentation, or database — and you can update the knowledge base without retraining anything.

Common Mistakes

  • Chunking documents too coarsely — large chunks reduce retrieval precision because a 2000-token chunk matching a query may contain only one relevant sentence.
  • Ignoring chunk overlap — without overlap between adjacent chunks, sentences that span a chunk boundary lose context in retrieval.
  • Using cosine similarity as the only ranking signal — re-ranking retrieved chunks with a cross-encoder before sending to the LLM significantly improves answer quality.
  • Skipping the retrieval evaluation step — measuring recall@K (did the right chunk get retrieved?) is essential before tuning generation quality.

Code Examples

✗ Vulnerable
// ❌ Injecting entire documents into context instead of relevant chunks
function buildPrompt(string $question): string {
    $docs = file_get_contents('entire_knowledge_base.txt'); // 200k tokens
    return "Use this context:\n$docs\n\nQuestion: $question";
    // Blows context window, dilutes relevance, expensive per-call
}
✓ Fixed
// ✅ RAG with chunked retrieval — only relevant sections in context
function buildRagPrompt(string $question, VectorStore $store): string
{
    // Embed the query
    $queryVector = $embedder->embed($question);

    // Retrieve top-k relevant 256-512 token chunks (not entire documents)
    $chunks = $store->similaritySearch($queryVector, topK: 5);

    // Build grounded context block
    $context = implode("\n---\n", array_column($chunks, 'text'));

    return "Use ONLY the following context to answer.\n"
         . "Context:\n$context\n\n"
         . "Question: $question";
}

// Chunking: split documents into ~400 token overlapping chunks at ingest
function chunkDocument(string $text, int $size = 400, int $overlap = 50): array
{
    $tokens = tokenize($text);
    $chunks = [];
    for ($i = 0; $i < count($tokens); $i += ($size - $overlap)) {
        $chunks[] = implode(' ', array_slice($tokens, $i, $size));
    }
    return $chunks;
}

Added 23 Mar 2026
Views 42
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 2 pings W 0 pings T 1 ping F 2 pings S 0 pings S 0 pings M 0 pings T 0 pings W 1 ping T 4 pings F 1 ping S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 1 ping F 1 ping S
Amazonbot 15 Perplexity 8 Google 6 ChatGPT 3 Ahrefs 3 SEMrush 2 Meta AI 1 Qwen 1
crawler 35 crawler_json 4
DEV INTEL Tools & Severity
🔵 Info ⚙ Fix effort: High
⚡ Quick Fix
Split documents into 256–512 token overlapping chunks, embed with the same model used at query time, retrieve top-5, pass as context before the user question

✓ schema.org compliant