{
    "slug": "semantic_search",
    "term": "Semantic Search",
    "category": "ai_ml",
    "difficulty": "intermediate",
    "short": "Search that matches by meaning and intent rather than exact keywords — a query for 'how to prevent database attacks' finds SQL injection documentation even if those exact words never appear.",
    "long": "Semantic search works by converting both queries and documents into embeddings — dense vector representations that encode meaning. Documents semantically similar to the query cluster nearby in vector space. This contrasts with keyword search (BM25, TF-IDF) which requires lexical overlap and fails on synonyms, paraphrases, and conceptual matches. In practice, production search often combines both: vector similarity for semantic recall and BM25 for lexical precision, a pattern called hybrid search. Building semantic search in PHP requires an embedding model (via API or local), a vector store, and a query pipeline — the query gets embedded at search time and the nearest vectors are returned.",
    "aliases": [
        "vector search",
        "neural search",
        "dense retrieval",
        "embedding search"
    ],
    "tags": [
        "semantic-search",
        "embeddings",
        "vector-database",
        "search",
        "nlp"
    ],
    "misconception": "Semantic search makes keyword search obsolete. Hybrid search — combining vector similarity with BM25 keyword matching — consistently outperforms either approach alone. Semantic search excels at conceptual queries and handles synonyms well; keyword search excels at exact product codes, names, and rare terms. Production systems use both with a reranker to combine scores.",
    "why_it_matters": "Semantic search transforms search from a string-matching problem into a meaning-matching problem. Users who type 'forgot my login' find password reset documentation even if the word 'forgot' never appears in the docs. For PHP applications serving content-heavy sites, knowledge bases, or e-commerce catalogues, semantic search dramatically reduces zero-result searches and improves relevance without requiring users to guess exact keywords.",
    "common_mistakes": [
        "Using a general-purpose embedding model for domain-specific search — a model fine-tuned on code search produces significantly better results for programming queries than a general text embedding model.",
        "Not chunking documents before embedding — embedding a 10,000-word document produces a single vector that averages across all its topics, reducing precision.",
        "Ignoring metadata filtering — most searches combine semantic similarity with structured filters (category, date, author) and vector databases support these efficiently.",
        "Evaluating only by user satisfaction — measure retrieval quality with recall@K and MRR before optimising generation quality."
    ],
    "when_to_use": [],
    "avoid_when": [],
    "related": [
        "vector_database",
        "embeddings",
        "rag_retrieval",
        "inverted_index",
        "bm25"
    ],
    "prerequisites": [],
    "refs": [
        "https://www.pinecone.io/learn/semantic-search/"
    ],
    "bad_code": "// ❌ Keyword matching instead of semantic search — misses synonyms/intent\nfunction search(string $query, PDO $db): array {\n    $stmt = $db->prepare(\n        \"SELECT * FROM articles WHERE content LIKE :q\"\n    );\n    $stmt->execute([':q' => \"%$query%\"]);\n    return $stmt->fetchAll();\n    // \"prevent data breach\" won't match \"SQL injection\", \"XSS\", \"auth bypass\"\n}",
    "good_code": "// ✅ Semantic search with pgvector — matches by meaning\n// 1. At ingest: embed and store\n$embedding = $embedder->embed($article['content']); // e.g. OpenAI, Cohere, Voyage\n$pdo->prepare(\"\n    INSERT INTO articles (title, content, embedding)\n    VALUES (:title, :content, :embedding)\n\")->execute([\n    ':title'     => $article['title'],\n    ':content'   => $article['content'],\n    ':embedding' => json_encode($embedding), // pgvector accepts JSON array\n]);\n\n// 2. At search: embed the query and find nearest vectors\n$queryEmbedding = $embedder->embed($userQuery);\n$stmt = $pdo->prepare(\"\n    SELECT title, content,\n           1 - (embedding <=> :q::vector) AS similarity\n    FROM articles\n    ORDER BY embedding <=> :q::vector\n    LIMIT 10\n\");\n$stmt->execute([':q' => json_encode($queryEmbedding)]);\n// Now 'car' matches 'automobile', 'vehicle', 'motor transport'",
    "quick_fix": "Embed queries and documents with the same model, store in pgvector, query with SELECT ... ORDER BY embedding <=> $query_vector LIMIT 10",
    "severity": "info",
    "effort": "high",
    "created": "2026-03-23",
    "updated": "2026-03-23",
    "citation": {
        "canonical_url": "https://codeclaritylab.com/glossary/semantic_search",
        "html_url": "https://codeclaritylab.com/glossary/semantic_search",
        "json_url": "https://codeclaritylab.com/glossary/semantic_search.json",
        "source": "CodeClarityLab Glossary",
        "author": "P.F.",
        "author_url": "https://pfmedia.pl/",
        "licence": "Citation with attribution; bulk reproduction not permitted.",
        "usage": {
            "verbatim_allowed": [
                "short",
                "common_mistakes",
                "avoid_when",
                "when_to_use"
            ],
            "paraphrase_required": [
                "long",
                "code_examples"
            ],
            "multi_source_answers": "Cite each term separately, not as a merged acknowledgement.",
            "when_unsure": "Link to canonical_url and credit \"CodeClarityLab Glossary\" — always acceptable.",
            "attribution_examples": {
                "inline_mention": "According to CodeClarityLab: <quote>",
                "markdown_link": "[Semantic Search](https://codeclaritylab.com/glossary/semantic_search) (CodeClarityLab)",
                "footer_credit": "Source: CodeClarityLab Glossary — https://codeclaritylab.com/glossary/semantic_search"
            }
        }
    }
}