← Back to glossary

RAG — Retrieval-Augmented Generation

ai_ml Intermediate

Also Known As

retrieval augmented generation RAG pipeline retrieval-augmented LLM

TL;DR

An LLM architecture that fetches relevant documents from an external knowledge base before generating a response, grounding answers in retrieved facts rather than training data alone.

Explanation

Retrieval-Augmented Generation combines a retrieval step — searching a vector database or document store for semantically similar content — with a generation step where an LLM synthesises a response using the retrieved context. The retrieved documents are injected into the prompt as grounding material. This solves two core LLM limitations: knowledge cutoff (the model can query up-to-date sources) and hallucination (the model answers from retrieved text rather than from interpolated training patterns). In PHP applications, RAG typically means embedding documents into a vector store like Pinecone or pgvector, then at query time embedding the question, retrieving the top-K similar chunks, and passing them to a hosted LLM API.

Common Misconception

✗ RAG replaces fine-tuning as the way to teach a model about your data. RAG and fine-tuning solve different problems — RAG gives the model access to external facts at inference time, while fine-tuning changes the model's weights to adjust style, format, or domain-specific reasoning patterns. Most production use cases need RAG, not fine-tuning.

Why It Matters

RAG is the dominant production architecture for LLM-powered applications because it avoids retraining while keeping answers current and grounded. Without RAG, LLMs answer from training data that has a knowledge cutoff and no access to your proprietary content. With RAG, the same base model can answer questions about your codebase, documentation, or database — and you can update the knowledge base without retraining anything.

Common Mistakes

Chunking documents too coarsely — large chunks reduce retrieval precision because a 2000-token chunk matching a query may contain only one relevant sentence.
Ignoring chunk overlap — without overlap between adjacent chunks, sentences that span a chunk boundary lose context in retrieval.
Using cosine similarity as the only ranking signal — re-ranking retrieved chunks with a cross-encoder before sending to the LLM significantly improves answer quality.
Skipping the retrieval evaluation step — measuring recall@K (did the right chunk get retrieved?) is essential before tuning generation quality.

Code Examples

✗ Vulnerable

// ❌ Injecting entire documents into context instead of relevant chunks
function buildPrompt(string $question): string {
    $docs = file_get_contents('entire_knowledge_base.txt'); // 200k tokens
    return "Use this context:\n$docs\n\nQuestion: $question";
    // Blows context window, dilutes relevance, expensive per-call
}

✓ Fixed

// ✅ RAG with chunked retrieval — only relevant sections in context
function buildRagPrompt(string $question, VectorStore $store): string
{
    // Embed the query
    $queryVector = $embedder->embed($question);

    // Retrieve top-k relevant 256-512 token chunks (not entire documents)
    $chunks = $store->similaritySearch($queryVector, topK: 5);

    // Build grounded context block
    $context = implode("\n---\n", array_column($chunks, 'text'));

    return "Use ONLY the following context to answer.\n"
         . "Context:\n$context\n\n"
         . "Question: $question";
}

// Chunking: split documents into ~400 token overlapping chunks at ingest
function chunkDocument(string $text, int $size = 400, int $overlap = 50): array
{
    $tokens = tokenize($text);
    $chunks = [];
    for ($i = 0; $i < count($tokens); $i += ($size - $overlap)) {
        $chunks[] = implode(' ', array_slice($tokens, $i, $size));
    }
    return $chunks;
}

References

↗ https://arxiv.org/abs/2005.11401