← Back to glossary

Semantic Search

ai_ml Intermediate

Also Known As

vector search neural search dense retrieval embedding search

TL;DR

Search that matches by meaning and intent rather than exact keywords — a query for 'how to prevent database attacks' finds SQL injection documentation even if those exact words never appear.

Explanation

Semantic search works by converting both queries and documents into embeddings — dense vector representations that encode meaning. Documents semantically similar to the query cluster nearby in vector space. This contrasts with keyword search (BM25, TF-IDF) which requires lexical overlap and fails on synonyms, paraphrases, and conceptual matches. In practice, production search often combines both: vector similarity for semantic recall and BM25 for lexical precision, a pattern called hybrid search. Building semantic search in PHP requires an embedding model (via API or local), a vector store, and a query pipeline — the query gets embedded at search time and the nearest vectors are returned.

Common Misconception

✗ Semantic search makes keyword search obsolete. Hybrid search — combining vector similarity with BM25 keyword matching — consistently outperforms either approach alone. Semantic search excels at conceptual queries and handles synonyms well; keyword search excels at exact product codes, names, and rare terms. Production systems use both with a reranker to combine scores.

Why It Matters

Semantic search transforms search from a string-matching problem into a meaning-matching problem. Users who type 'forgot my login' find password reset documentation even if the word 'forgot' never appears in the docs. For PHP applications serving content-heavy sites, knowledge bases, or e-commerce catalogues, semantic search dramatically reduces zero-result searches and improves relevance without requiring users to guess exact keywords.

Common Mistakes

Using a general-purpose embedding model for domain-specific search — a model fine-tuned on code search produces significantly better results for programming queries than a general text embedding model.
Not chunking documents before embedding — embedding a 10,000-word document produces a single vector that averages across all its topics, reducing precision.
Ignoring metadata filtering — most searches combine semantic similarity with structured filters (category, date, author) and vector databases support these efficiently.
Evaluating only by user satisfaction — measure retrieval quality with recall@K and MRR before optimising generation quality.

Code Examples

✗ Vulnerable

// ❌ Keyword matching instead of semantic search — misses synonyms/intent
function search(string $query, PDO $db): array {
    $stmt = $db->prepare(
        "SELECT * FROM articles WHERE content LIKE :q"
    );
    $stmt->execute([':q' => "%$query%"]);
    return $stmt->fetchAll();
    // "prevent data breach" won't match "SQL injection", "XSS", "auth bypass"
}

✓ Fixed

// ✅ Semantic search with pgvector — matches by meaning
// 1. At ingest: embed and store
$embedding = $embedder->embed($article['content']); // e.g. OpenAI, Cohere, Voyage
$pdo->prepare("
    INSERT INTO articles (title, content, embedding)
    VALUES (:title, :content, :embedding)
")->execute([
    ':title'     => $article['title'],
    ':content'   => $article['content'],
    ':embedding' => json_encode($embedding), // pgvector accepts JSON array
]);

// 2. At search: embed the query and find nearest vectors
$queryEmbedding = $embedder->embed($userQuery);
$stmt = $pdo->prepare("
    SELECT title, content,
           1 - (embedding <=> :q::vector) AS similarity
    FROM articles
    ORDER BY embedding <=> :q::vector
    LIMIT 10
");
$stmt->execute([':q' => json_encode($queryEmbedding)]);
// Now 'car' matches 'automobile', 'vehicle', 'motor transport'

References

↗ https://www.pinecone.io/learn/semantic-search/