← CodeClarityLab Home
Browse by Category
+ added · updated 7d
← Back to glossary

Semantic Search

ai_ml Intermediate

Also Known As

vector search neural search dense retrieval embedding search

TL;DR

Search that matches by meaning and intent rather than exact keywords — a query for 'how to prevent database attacks' finds SQL injection documentation even if those exact words never appear.

Explanation

Semantic search works by converting both queries and documents into embeddings — dense vector representations that encode meaning. Documents semantically similar to the query cluster nearby in vector space. This contrasts with keyword search (BM25, TF-IDF) which requires lexical overlap and fails on synonyms, paraphrases, and conceptual matches. In practice, production search often combines both: vector similarity for semantic recall and BM25 for lexical precision, a pattern called hybrid search. Building semantic search in PHP requires an embedding model (via API or local), a vector store, and a query pipeline — the query gets embedded at search time and the nearest vectors are returned.

Common Misconception

Semantic search makes keyword search obsolete. Hybrid search — combining vector similarity with BM25 keyword matching — consistently outperforms either approach alone. Semantic search excels at conceptual queries and handles synonyms well; keyword search excels at exact product codes, names, and rare terms. Production systems use both with a reranker to combine scores.

Why It Matters

Semantic search transforms search from a string-matching problem into a meaning-matching problem. Users who type 'forgot my login' find password reset documentation even if the word 'forgot' never appears in the docs. For PHP applications serving content-heavy sites, knowledge bases, or e-commerce catalogues, semantic search dramatically reduces zero-result searches and improves relevance without requiring users to guess exact keywords.

Common Mistakes

  • Using a general-purpose embedding model for domain-specific search — a model fine-tuned on code search produces significantly better results for programming queries than a general text embedding model.
  • Not chunking documents before embedding — embedding a 10,000-word document produces a single vector that averages across all its topics, reducing precision.
  • Ignoring metadata filtering — most searches combine semantic similarity with structured filters (category, date, author) and vector databases support these efficiently.
  • Evaluating only by user satisfaction — measure retrieval quality with recall@K and MRR before optimising generation quality.

Code Examples

✗ Vulnerable
// ❌ Keyword matching instead of semantic search — misses synonyms/intent
function search(string $query, PDO $db): array {
    $stmt = $db->prepare(
        "SELECT * FROM articles WHERE content LIKE :q"
    );
    $stmt->execute([':q' => "%$query%"]);
    return $stmt->fetchAll();
    // "prevent data breach" won't match "SQL injection", "XSS", "auth bypass"
}
✓ Fixed
// ✅ Semantic search with pgvector — matches by meaning
// 1. At ingest: embed and store
$embedding = $embedder->embed($article['content']); // e.g. OpenAI, Cohere, Voyage
$pdo->prepare("
    INSERT INTO articles (title, content, embedding)
    VALUES (:title, :content, :embedding)
")->execute([
    ':title'     => $article['title'],
    ':content'   => $article['content'],
    ':embedding' => json_encode($embedding), // pgvector accepts JSON array
]);

// 2. At search: embed the query and find nearest vectors
$queryEmbedding = $embedder->embed($userQuery);
$stmt = $pdo->prepare("
    SELECT title, content,
           1 - (embedding <=> :q::vector) AS similarity
    FROM articles
    ORDER BY embedding <=> :q::vector
    LIMIT 10
");
$stmt->execute([':q' => json_encode($queryEmbedding)]);
// Now 'car' matches 'automobile', 'vehicle', 'motor transport'

Added 23 Mar 2026
Views 29
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 1 ping S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 1 ping S 1 ping S 0 pings M 0 pings T 0 pings W 1 ping T 3 pings F 1 ping S 0 pings S 0 pings M 0 pings T 1 ping W 0 pings T 0 pings F 1 ping S
No pings yesterday
Amazonbot 9 Perplexity 7 ChatGPT 3 SEMrush 3 Google 1 Ahrefs 1
crawler 23 crawler_json 1
DEV INTEL Tools & Severity
🔵 Info ⚙ Fix effort: High
⚡ Quick Fix
Embed queries and documents with the same model, store in pgvector, query with SELECT ... ORDER BY embedding <=> $query_vector LIMIT 10

✓ schema.org compliant