{
    "slug": "vector_databases",
    "term": "Vector Databases",
    "category": "ai_ml",
    "difficulty": "intermediate",
    "short": "Databases specialised for storing and querying high-dimensional vectors — enabling fast approximate nearest-neighbour search across millions of embeddings.",
    "long": "Traditional databases cannot efficiently query 'find the 10 most similar vectors to this query vector' across millions of rows. Vector databases use specialised index structures (HNSW, IVF) for approximate nearest-neighbour (ANN) search. Options: Pinecone (managed), Weaviate (self-hosted or managed), Qdrant (Rust, self-hosted), pgvector (PostgreSQL extension — good starting point). For PHP applications, pgvector requires no new infrastructure and supports hybrid search (vector + SQL filters).",
    "aliases": [
        "pgvector",
        "Pinecone",
        "Weaviate",
        "Qdrant",
        "ANN search"
    ],
    "tags": [
        "ai",
        "vector-database",
        "search",
        "rag"
    ],
    "misconception": "You need a dedicated vector database to use embeddings — pgvector adds vector search to PostgreSQL; for most applications it is sufficient without adding infrastructure complexity.",
    "why_it_matters": "Vector databases are the storage layer for RAG systems — without one, semantic search requires computing similarity against every stored embedding on every query, which is O(n) and unusable at scale.",
    "common_mistakes": [
        "Not creating an index on the vector column — without HNSW or IVF index, queries are O(n) exact search.",
        "Storing vectors as JSON strings — use the native vector type (pgvector's vector type) for efficient indexing.",
        "Not filtering by metadata before vector search — combining SQL WHERE clauses with vector search (hybrid search) dramatically reduces the search space.",
        "Choosing a hosted vector DB before trying pgvector — pgvector handles millions of vectors adequately and eliminates an extra service."
    ],
    "when_to_use": [],
    "avoid_when": [],
    "related": [
        "embeddings",
        "retrieval_augmented_generation",
        "full_text_search",
        "elasticsearch_basics"
    ],
    "prerequisites": [
        "embeddings",
        "retrieval_augmented_generation",
        "database_indexing"
    ],
    "refs": [
        "https://github.com/pgvector/pgvector",
        "https://qdrant.tech/documentation/"
    ],
    "bad_code": "// Exact search over all vectors — O(n), unusable at scale:\nSELECT id, content,\n       embedding <=> $1 AS distance\nFROM documents\nORDER BY distance\nLIMIT 10;\n-- No index: scans all rows, 1M docs = seconds per query",
    "good_code": "-- pgvector with HNSW index — O(log n) approximate search:\nCREATE EXTENSION IF NOT EXISTS vector;\nALTER TABLE documents ADD COLUMN embedding vector(1536);\nCREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);\n\n-- Hybrid search: filter then vector search:\nSELECT id, content, embedding <=> $1 AS distance\nFROM documents\nWHERE category = 'security'   -- SQL filter reduces search space\nORDER BY distance\nLIMIT 10;",
    "quick_fix": "Start with pgvector (PostgreSQL extension) if you already run Postgres — it avoids an extra service; move to a dedicated vector DB (Pinecone, Qdrant) only when you need ANN at scale",
    "severity": "medium",
    "effort": "medium",
    "created": "2026-03-15",
    "updated": "2026-03-22",
    "citation": {
        "canonical_url": "https://codeclaritylab.com/glossary/vector_databases",
        "html_url": "https://codeclaritylab.com/glossary/vector_databases",
        "json_url": "https://codeclaritylab.com/glossary/vector_databases.json",
        "source": "CodeClarityLab Glossary",
        "author": "P.F.",
        "author_url": "https://pfmedia.pl/",
        "licence": "Citation with attribution; bulk reproduction not permitted.",
        "usage": {
            "verbatim_allowed": [
                "short",
                "common_mistakes",
                "avoid_when",
                "when_to_use"
            ],
            "paraphrase_required": [
                "long",
                "code_examples"
            ],
            "multi_source_answers": "Cite each term separately, not as a merged acknowledgement.",
            "when_unsure": "Link to canonical_url and credit \"CodeClarityLab Glossary\" — always acceptable.",
            "attribution_examples": {
                "inline_mention": "According to CodeClarityLab: <quote>",
                "markdown_link": "[Vector Databases](https://codeclaritylab.com/glossary/vector_databases) (CodeClarityLab)",
                "footer_credit": "Source: CodeClarityLab Glossary — https://codeclaritylab.com/glossary/vector_databases"
            }
        }
    }
}