← Home ← Codex ← DEBT
Browse by Category
+ added · updated 7d
← Back to glossary

Semantic Search

AI / ML Intermediate
debt(d8/e6/b6/t6)
d8 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'silent in production until users hit it' (d8). No detection_hints provided; poor retrieval quality (wrong model, no chunking, missing hybrid search) is silent — queries return results, just irrelevant ones. Only systematic recall@K/MRR evaluation catches it, slightly better than d9 because retrieval metrics tools exist.

e6 Effort Remediation debt — work required to fix once spotted

Closest to 'cross-cutting refactor across the codebase' (e6). The quick_fix suggests a simple pgvector query, but fixing real problems (switching embedding models, adding chunking pipeline, introducing hybrid search with BM25 + reranker) requires re-embedding the entire corpus and changing ingestion + query paths across multiple components.

b6 Burden Structural debt — long-term weight of choosing wrong

Closest to 'strong gravitational pull' (b6). Semantic search becomes load-bearing for content discovery; embedding model choice, vector DB, and chunking strategy shape ingestion pipelines, storage, and query layers. Migrating models forces full re-indexing. Between persistent tax (b5) and gravitational pull (b7).

t6 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap' (t6). The misconception that semantic search replaces keyword search is widespread and contradicts production reality where hybrid search wins. Additional traps (chunking, domain-specific embeddings) compound it, but the concept is named intuitively enough to avoid t7.

About DEBT scoring →

Also Known As

vector search neural search dense retrieval embedding search

TL;DR

Search that matches by meaning and intent rather than exact keywords — a query for 'how to prevent database attacks' finds SQL injection documentation even if those exact words never appear.

Explanation

Semantic search works by converting both queries and documents into embeddings — dense vector representations that encode meaning. Documents semantically similar to the query cluster nearby in vector space. This contrasts with keyword search (BM25, TF-IDF) which requires lexical overlap and fails on synonyms, paraphrases, and conceptual matches. In practice, production search often combines both: vector similarity for semantic recall and BM25 for lexical precision, a pattern called hybrid search. Building semantic search in PHP requires an embedding model (via API or local), a vector store, and a query pipeline — the query gets embedded at search time and the nearest vectors are returned.

Common Misconception

Semantic search makes keyword search obsolete. Hybrid search — combining vector similarity with BM25 keyword matching — consistently outperforms either approach alone. Semantic search excels at conceptual queries and handles synonyms well; keyword search excels at exact product codes, names, and rare terms. Production systems use both with a reranker to combine scores.

Why It Matters

Semantic search transforms search from a string-matching problem into a meaning-matching problem. Users who type 'forgot my login' find password reset documentation even if the word 'forgot' never appears in the docs. For PHP applications serving content-heavy sites, knowledge bases, or e-commerce catalogues, semantic search dramatically reduces zero-result searches and improves relevance without requiring users to guess exact keywords.

Common Mistakes

  • Using a general-purpose embedding model for domain-specific search — a model fine-tuned on code search produces significantly better results for programming queries than a general text embedding model.
  • Not chunking documents before embedding — embedding a 10,000-word document produces a single vector that averages across all its topics, reducing precision.
  • Ignoring metadata filtering — most searches combine semantic similarity with structured filters (category, date, author) and vector databases support these efficiently.
  • Evaluating only by user satisfaction — measure retrieval quality with recall@K and MRR before optimising generation quality.

Code Examples

✗ Vulnerable
// ❌ Keyword matching instead of semantic search — misses synonyms/intent
function search(string $query, PDO $db): array {
    $stmt = $db->prepare(
        "SELECT * FROM articles WHERE content LIKE :q"
    );
    $stmt->execute([':q' => "%$query%"]);
    return $stmt->fetchAll();
    // "prevent data breach" won't match "SQL injection", "XSS", "auth bypass"
}
✓ Fixed
// ✅ Semantic search with pgvector — matches by meaning
// 1. At ingest: embed and store
$embedding = $embedder->embed($article['content']); // e.g. OpenAI, Cohere, Voyage
$pdo->prepare("
    INSERT INTO articles (title, content, embedding)
    VALUES (:title, :content, :embedding)
")->execute([
    ':title'     => $article['title'],
    ':content'   => $article['content'],
    ':embedding' => json_encode($embedding), // pgvector accepts JSON array
]);

// 2. At search: embed the query and find nearest vectors
$queryEmbedding = $embedder->embed($userQuery);
$stmt = $pdo->prepare("
    SELECT title, content,
           1 - (embedding <=> :q::vector) AS similarity
    FROM articles
    ORDER BY embedding <=> :q::vector
    LIMIT 10
");
$stmt->execute([':q' => json_encode($queryEmbedding)]);
// Now 'car' matches 'automobile', 'vehicle', 'motor transport'

Added 23 Mar 2026
Views 61
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings T 0 pings W 1 ping T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 2 pings T 2 pings F 2 pings S 1 ping S 1 ping M 1 ping T 0 pings W 0 pings T 1 ping F 0 pings S 0 pings S 0 pings M 0 pings T 1 ping W 0 pings T 1 ping F 2 pings S 1 ping S 0 pings M 0 pings T 1 ping W
Claude 1
No pings yesterday
Amazonbot 10 Perplexity 7 SEMrush 7 Scrapy 7 ChatGPT 5 Google 3 Ahrefs 3 Claude 3 Bing 3 PetalBot 2 Meta AI 1
crawler 46 crawler_json 5
DEV INTEL Tools & Severity
🔵 Info ⚙ Fix effort: High
⚡ Quick Fix
Embed queries and documents with the same model, store in pgvector, query with SELECT ... ORDER BY embedding <=> $query_vector LIMIT 10


✓ schema.org compliant