{
    "slug": "search_query_parsing",
    "term": "Search Query Parsing",
    "category": "search",
    "difficulty": "intermediate",
    "short": "Transforming raw user search input into structured components - tokens, fields, operators, and phrases - before query execution.",
    "long": "Search query parsing is the front-end stage of a search pipeline that turns a raw input string into a structured representation the search engine can execute. A user types something like `title:laptop \"gaming rig\" -refurbished price:<500`, and before any matching happens, the parser must split that string into meaningful units: free-text tokens, quoted phrases, field-scoped terms, boolean and negation operators, range expressions, and modifiers like fuzziness or boost. This is distinct from index-time tokenisation (stemming, lowercasing, stop-word removal) - query parsing decides the shape of the query, not how documents were analysed, though the two must agree on analysis rules to match correctly. A typical parser works in stages. First, lexing splits the input into tokens while respecting quoting and escaping so that a quoted phrase stays intact and a literal colon inside a value is not mistaken for a field separator. Next, operator recognition identifies syntax such as AND/OR/NOT, leading minus for exclusion, field:value pairs, and comparison operators for ranges. Finally, the parser builds an abstract structure - often a tree of clauses - that maps cleanly onto the engine's query DSL (Elasticsearch bool/match queries, Meilisearch filters, or SQL conditions). The hard parts are ambiguity and safety. Users mistype operators, leave quotes unbalanced, or inject control characters, so a robust parser degrades gracefully: an unmatched quote becomes literal text rather than a crash. Field extraction must validate against an allowlist of searchable fields, both to avoid surprising results and to prevent users from probing unindexed or sensitive fields. Many teams expose only a simplified query language to end users and reserve the full DSL for internal callers, because raw DSL passthrough is a common source of denial-of-service and information-disclosure issues. Good parsing also normalises - trimming whitespace, collapsing repeated operators, and applying default operators (usually AND between terms) - so the executed query reflects clear intent. Done well, query parsing is invisible; done badly, it produces zero-result pages, confusing rankings, or security holes.",
    "aliases": [
        "query parser",
        "query syntax parsing",
        "search query tokenization"
    ],
    "tags": [
        "search",
        "query-parsing",
        "tokenization",
        "operators",
        "field-extraction"
    ],
    "misconception": "Query parsing is the same as index-time tokenisation - in reality parsing structures the user's intent (fields, operators, phrases) while index tokenisation analyses document text; they are separate stages that only need to agree on analysis rules.",
    "why_it_matters": "A naive parser that passes raw input straight to the engine produces zero-result pages on a stray quote or operator, and can leak unindexed fields or trigger expensive queries, so robust parsing directly affects both usability and security.",
    "common_mistakes": [
        "Splitting on whitespace without respecting quotes, so a phrase search like \"gaming rig\" is shredded into separate terms.",
        "Passing raw query DSL from users to the engine, exposing it to denial-of-service and field-probing attacks.",
        "Not validating extracted field names against an allowlist, letting users query unindexed or sensitive fields.",
        "Crashing on malformed input (unbalanced quotes, dangling operators) instead of degrading to a literal text search.",
        "Forgetting to apply a default operator, so multi-term queries match unexpectedly broadly or narrowly."
    ],
    "when_to_use": [],
    "avoid_when": [],
    "related": [
        "search_relevance",
        "inverted_index",
        "faceted_search",
        "fuzzy_search"
    ],
    "prerequisites": [
        "inverted_index"
    ],
    "refs": [
        "https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html",
        "https://lucene.apache.org/core/current/queryparser/org/apache/lucene/queryparser/classic/package-summary.html",
        "https://www.meilisearch.com/docs/learn/filtering_and_sorting/filter_expression_reference"
    ],
    "bad_code": "<?php\n// Naive parser: split on spaces, pass straight to engine.\nfunction parseQuery(string $input): array {\n    $tokens = explode(' ', trim($input));\n    // \"gaming rig\" becomes two tokens; title:laptop is one opaque token\n    // -refurbished and quotes are ignored; no field validation\n    return ['must' => $tokens];\n}\n\n// Worse: raw DSL passthrough from user input\n$query = json_decode($_GET['q'], true);\n$results = $client->search('products', $query); // DoS + field probing",
    "good_code": "Add a comment or fallback noting unbalanced-quote handling, e.g. a preliminary check `if (substr_count($input, '\"') % 2 !== 0) { $input = str_replace('\"', '', $input); }` to degrade gracefully to literal text, matching the long-text claim.",
    "quick_fix": "Tokenise with a quote- and operator-aware lexer, validate extracted field names against an allowlist, and never pass raw query DSL from end users to the engine.",
    "severity": "medium",
    "effort": "medium",
    "created": "2026-06-09",
    "updated": "2026-06-09",
    "citation": {
        "canonical_url": "https://codeclaritylab.com/glossary/search_query_parsing",
        "html_url": "https://codeclaritylab.com/glossary/search_query_parsing",
        "json_url": "https://codeclaritylab.com/glossary/search_query_parsing.json",
        "source": "CodeClarityLab Glossary",
        "author": "P.F.",
        "author_url": "https://pfmedia.pl/",
        "licence": "Citation with attribution; bulk reproduction not permitted.",
        "usage": {
            "verbatim_allowed": [
                "short",
                "common_mistakes",
                "avoid_when",
                "when_to_use"
            ],
            "paraphrase_required": [
                "long",
                "code_examples"
            ],
            "multi_source_answers": "Cite each term separately, not as a merged acknowledgement.",
            "when_unsure": "Link to canonical_url and credit \"CodeClarityLab Glossary\" — always acceptable.",
            "attribution_examples": {
                "inline_mention": "According to CodeClarityLab: <quote>",
                "markdown_link": "[Search Query Parsing](https://codeclaritylab.com/glossary/search_query_parsing) (CodeClarityLab)",
                "footer_credit": "Source: CodeClarityLab Glossary — https://codeclaritylab.com/glossary/search_query_parsing"
            }
        }
    }
}