← Home ← Codex ← DEBT
Browse by Category
+ added · updated 7d
← Back to glossary

Search Query Parsing

search PHP 8.0+ Intermediate
debt(d7/e5/b5/t7)
d7 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'only careful code review or runtime testing' (d7). The detection_hints mark automated:no; phpstan/semgrep can flag the crude code_pattern (explode(' ', $query) feeding a search client, or json_decode of GET passed raw) but the semantic problems — missing field allowlist, no default operator, crash-on-malformed — are mostly invisible to static analysis and surface only under careful review or runtime testing of edge-case queries.

e5 Effort Remediation debt — work required to fix once spotted

Closest to 'touches multiple files / significant refactor in one component' (e5). The quick_fix requires building a quote- and operator-aware lexer, adding field allowlist validation, and routing all query construction through it — more than a parameterised swap; it's a focused refactor of the query layer touching the parser, validation, and engine-handoff code.

b5 Burden Structural debt — long-term weight of choosing wrong

Closest to 'persistent productivity tax' (b5). applies_to spans web/api/library contexts and the parser sits between users and the search engine; every new searchable field, operator, or feature must flow through this parsing layer, so the design choice exerts ongoing reach across search-related work streams without quite defining the whole system shape.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap' (t7). The misconception — that query parsing is the same as index-time tokenisation — leads competent developers to conflate two separate stages, and common_mistakes show the 'obvious' whitespace-split approach shreds phrase searches and the naive pass-through exposes field-probing/DoS; the behaviour contradicts the intuitive 'just split and search' model.

About DEBT scoring →

Also Known As

query parser query syntax parsing search query tokenization

TL;DR

Transforming raw user search input into structured components - tokens, fields, operators, and phrases - before query execution.

Explanation

Search query parsing is the front-end stage of a search pipeline that turns a raw input string into a structured representation the search engine can execute. A user types something like `title:laptop "gaming rig" -refurbished price:<500`, and before any matching happens, the parser must split that string into meaningful units: free-text tokens, quoted phrases, field-scoped terms, boolean and negation operators, range expressions, and modifiers like fuzziness or boost. This is distinct from index-time tokenisation (stemming, lowercasing, stop-word removal) - query parsing decides the shape of the query, not how documents were analysed, though the two must agree on analysis rules to match correctly. A typical parser works in stages. First, lexing splits the input into tokens while respecting quoting and escaping so that a quoted phrase stays intact and a literal colon inside a value is not mistaken for a field separator. Next, operator recognition identifies syntax such as AND/OR/NOT, leading minus for exclusion, field:value pairs, and comparison operators for ranges. Finally, the parser builds an abstract structure - often a tree of clauses - that maps cleanly onto the engine's query DSL (Elasticsearch bool/match queries, Meilisearch filters, or SQL conditions). The hard parts are ambiguity and safety. Users mistype operators, leave quotes unbalanced, or inject control characters, so a robust parser degrades gracefully: an unmatched quote becomes literal text rather than a crash. Field extraction must validate against an allowlist of searchable fields, both to avoid surprising results and to prevent users from probing unindexed or sensitive fields. Many teams expose only a simplified query language to end users and reserve the full DSL for internal callers, because raw DSL passthrough is a common source of denial-of-service and information-disclosure issues. Good parsing also normalises - trimming whitespace, collapsing repeated operators, and applying default operators (usually AND between terms) - so the executed query reflects clear intent. Done well, query parsing is invisible; done badly, it produces zero-result pages, confusing rankings, or security holes.

Common Misconception

Query parsing is the same as index-time tokenisation - in reality parsing structures the user's intent (fields, operators, phrases) while index tokenisation analyses document text; they are separate stages that only need to agree on analysis rules.

Why It Matters

A naive parser that passes raw input straight to the engine produces zero-result pages on a stray quote or operator, and can leak unindexed fields or trigger expensive queries, so robust parsing directly affects both usability and security.

Common Mistakes

  • Splitting on whitespace without respecting quotes, so a phrase search like "gaming rig" is shredded into separate terms.
  • Passing raw query DSL from users to the engine, exposing it to denial-of-service and field-probing attacks.
  • Not validating extracted field names against an allowlist, letting users query unindexed or sensitive fields.
  • Crashing on malformed input (unbalanced quotes, dangling operators) instead of degrading to a literal text search.
  • Forgetting to apply a default operator, so multi-term queries match unexpectedly broadly or narrowly.

Code Examples

✗ Vulnerable
<?php
// Naive parser: split on spaces, pass straight to engine.
function parseQuery(string $input): array {
    $tokens = explode(' ', trim($input));
    // "gaming rig" becomes two tokens; title:laptop is one opaque token
    // -refurbished and quotes are ignored; no field validation
    return ['must' => $tokens];
}

// Worse: raw DSL passthrough from user input
$query = json_decode($_GET['q'], true);
$results = $client->search('products', $query); // DoS + field probing
✓ Fixed
Add a comment or fallback noting unbalanced-quote handling, e.g. a preliminary check `if (substr_count($input, '"') % 2 !== 0) { $input = str_replace('"', '', $input); }` to degrade gracefully to literal text, matching the long-text claim.

Added 9 Jun 2026
Views 5
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S 0 pings S 0 pings M 2 pings T 2 pings W
Scrapy 2
Google 2
Google 2 Scrapy 2
crawler 4
DEV INTEL Tools & Severity
🟡 Medium ⚙ Fix effort: Medium
⚡ Quick Fix
Tokenise with a quote- and operator-aware lexer, validate extracted field names against an allowlist, and never pass raw query DSL from end users to the engine.
📦 Applies To
PHP 8.0+ web api library
🔗 Prerequisites
🔍 Detection Hints
explode(' ', $query) or preg_split on whitespace feeding search; json_decode($_GET[...]) passed directly to search client; no field allowlist before query execution
Auto-detectable: ✗ No phpstan semgrep
⚠ Related Problems
🤖 AI Agent
Confidence: Medium False Positives: Medium ✗ Manual fix Fix: Medium Context: Function Tests: Update
CWE-200 CWE-400

✓ schema.org compliant