RAG — Retrieval-Augmented Generation
Also Known As
retrieval augmented generation
RAG pipeline
retrieval-augmented LLM
TL;DR
An LLM architecture that fetches relevant documents from an external knowledge base before generating a response, grounding answers in retrieved facts rather than training data alone.
Explanation
Retrieval-Augmented Generation combines a retrieval step — searching a vector database or document store for semantically similar content — with a generation step where an LLM synthesises a response using the retrieved context. The retrieved documents are injected into the prompt as grounding material. This solves two core LLM limitations: knowledge cutoff (the model can query up-to-date sources) and hallucination (the model answers from retrieved text rather than from interpolated training patterns). In PHP applications, RAG typically means embedding documents into a vector store like Pinecone or pgvector, then at query time embedding the question, retrieving the top-K similar chunks, and passing them to a hosted LLM API.
Common Misconception
✗ RAG replaces fine-tuning as the way to teach a model about your data. RAG and fine-tuning solve different problems — RAG gives the model access to external facts at inference time, while fine-tuning changes the model's weights to adjust style, format, or domain-specific reasoning patterns. Most production use cases need RAG, not fine-tuning.
Why It Matters
RAG is the dominant production architecture for LLM-powered applications because it avoids retraining while keeping answers current and grounded. Without RAG, LLMs answer from training data that has a knowledge cutoff and no access to your proprietary content. With RAG, the same base model can answer questions about your codebase, documentation, or database — and you can update the knowledge base without retraining anything.
Common Mistakes
- Chunking documents too coarsely — large chunks reduce retrieval precision because a 2000-token chunk matching a query may contain only one relevant sentence.
- Ignoring chunk overlap — without overlap between adjacent chunks, sentences that span a chunk boundary lose context in retrieval.
- Using cosine similarity as the only ranking signal — re-ranking retrieved chunks with a cross-encoder before sending to the LLM significantly improves answer quality.
- Skipping the retrieval evaluation step — measuring recall@K (did the right chunk get retrieved?) is essential before tuning generation quality.
Code Examples
✗ Vulnerable
// ❌ Injecting entire documents into context instead of relevant chunks
function buildPrompt(string $question): string {
$docs = file_get_contents('entire_knowledge_base.txt'); // 200k tokens
return "Use this context:\n$docs\n\nQuestion: $question";
// Blows context window, dilutes relevance, expensive per-call
}
✓ Fixed
// ✅ RAG with chunked retrieval — only relevant sections in context
function buildRagPrompt(string $question, VectorStore $store): string
{
// Embed the query
$queryVector = $embedder->embed($question);
// Retrieve top-k relevant 256-512 token chunks (not entire documents)
$chunks = $store->similaritySearch($queryVector, topK: 5);
// Build grounded context block
$context = implode("\n---\n", array_column($chunks, 'text'));
return "Use ONLY the following context to answer.\n"
. "Context:\n$context\n\n"
. "Question: $question";
}
// Chunking: split documents into ~400 token overlapping chunks at ingest
function chunkDocument(string $text, int $size = 400, int $overlap = 50): array
{
$tokens = tokenize($text);
$chunks = [];
for ($i = 0; $i < count($tokens); $i += ($size - $overlap)) {
$chunks[] = implode(' ', array_slice($tokens, $i, $size));
}
return $chunks;
}
References
Tags
🤝 Adopt this term
£79/year · your link shown here
Added
23 Mar 2026
Views
42
🤖 AI Guestbook educational data only
|
|
Last 30 days
Agents 1
Amazonbot 15
Perplexity 8
Google 6
ChatGPT 3
Ahrefs 3
SEMrush 2
Meta AI 1
Qwen 1
Also referenced
How they use it
crawler 35
crawler_json 4
Related categories
⚡
DEV INTEL
Tools & Severity
🔵 Info
⚙ Fix effort: High
⚡ Quick Fix
Split documents into 256–512 token overlapping chunks, embed with the same model used at query time, retrieve top-5, pass as context before the user question