← Home ← Codex ← DEBT
Browse by Category
+ added · updated 7d
← Back to glossary

Lexing & Parsing

Compiler PHP 7.0+ Advanced
debt(d7/e5/b3/t7)
d7 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'only careful code review or runtime testing' (d7). The detection_hints indicate automated detection is 'no' and the code_pattern is 'regex used to parse PHP source code' — a competent reviewer must spot the regex-based parsing pattern. Tools like php-parser, phpstan, and rector can surface AST-related issues but won't automatically flag 'you should be using a real parser instead of regex' in an automated pipeline.

e5 Effort Remediation debt — work required to fix once spotted

Closest to 'touches multiple files / significant refactor in one component' (e5). The quick_fix says to use nikic/php-parser, but migrating from regex-based PHP parsing to a full AST-based approach typically requires rewriting the analysis/transformation logic to traverse AST nodes, update parent references, and handle byte offsets — this is more than a one-line swap and likely touches multiple files, though it is contained to the tooling component rather than being cross-cutting.

b3 Burden Structural debt — long-term weight of choosing wrong

Closest to 'localised tax' (b3). The applies_to scope is cli contexts only (tooling/static analysis work), meaning this choice does not affect runtime application code broadly. The burden is felt primarily within the code analysis or transformation component, leaving the rest of the codebase unaffected.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap (contradicts how a similar concept works elsewhere)' (t7). The misconception field directly states the canonical wrong belief: developers assume PHP source is executed directly, when in reality it is lexed → parsed → compiled → executed. This contradicts the mental model many developers carry from scripting or interpreted language contexts, and the common mistake of using regex to parse PHP (which cannot handle recursive structures) confirms this is a serious, systematic misunderstanding.

About DEBT scoring →

Also Known As

tokeniser lexer parser AST token_get_all

TL;DR

Two stages of language processing — the lexer converts source text to tokens, the parser converts tokens to an Abstract Syntax Tree representing the program's structure.

Explanation

Lexing (tokenisation): scans characters and groups them into meaningful tokens — T_FUNCTION, T_STRING, T_WHITESPACE. Whitespace and comments are typically discarded. Parsing: takes the token stream and builds an AST according to the language grammar (typically expressed as a context-free grammar). PHP's token_get_all() exposes the lexer output. Nikic's PHP-Parser builds the full AST. Applications: PHPStan traverses the AST for type checking, Rector modifies the AST for code transformations, php-cs-fixer analyses token structure for style checking.

Common Misconception

PHP source code is executed directly — PHP first lexes to tokens, parses to AST, compiles to opcodes, then executes — the raw source text never runs directly.

Why It Matters

Understanding lexing and parsing explains how PHPStan finds type errors before running code, why Rector can safely refactor thousands of files, and how you can build custom static analysis tools.

Common Mistakes

  • Parsing PHP with regex — regex cannot handle recursive structures like nested expressions.
  • Not understanding that the AST represents semantics not formatting — reformatting does not change the AST.
  • Token positions are byte offsets not character positions — important for multibyte PHP source.
  • Modifying AST nodes without updating their parent references — causes tree inconsistency.

Code Examples

✗ Vulnerable
// Regex parsing of PHP — brittle and wrong:
$functions = [];
preg_match_all('/function\s+(\w+)\s*\(/', $source, $matches);
// Misses: closures, arrow functions, methods, functions in strings
// Breaks on: comments containing 'function', heredoc, nested structures
✓ Fixed
// PHP-Parser — correct AST-based analysis:
use PhpParser\ParserFactory;
use PhpParser\NodeTraverser;
use PhpParser\NodeVisitorAbstract;
use PhpParser\Node;

$parser   = (new ParserFactory)->createForNewestSupportedVersion();
$ast      = $parser->parse($sourceCode);

$traverser = new NodeTraverser();
$traverser->addVisitor(new class extends NodeVisitorAbstract {
    public function enterNode(Node $node): void {
        if ($node instanceof Node\Stmt\Function_) {
            echo 'Found function: ' . $node->name . PHP_EOL;
        }
    }
});
$traverser->traverse($ast);

Added 16 Mar 2026
Edited 5 Apr 2026
Views 57
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings T 2 pings W 1 ping T 0 pings F 0 pings S 0 pings S 1 ping M 0 pings T 0 pings W 1 ping T 1 ping F 2 pings S 0 pings S 2 pings M 0 pings T 0 pings W 0 pings T 1 ping F 0 pings S 0 pings S 1 ping M 0 pings T 0 pings W 0 pings T 0 pings F 1 ping S 1 ping S 1 ping M 1 ping T 0 pings W
No pings yet today
PetalBot 1
Amazonbot 10 Perplexity 8 Google 5 Ahrefs 4 Scrapy 4 SEMrush 3 PetalBot 3 Majestic 2 Unknown AI 2 Claude 2 Bing 1 Meta AI 1
crawler 39 crawler_json 5 pre-tracking 1
DEV INTEL Tools & Severity
🔵 Info ⚙ Fix effort: High
⚡ Quick Fix
Use nikic/php-parser (which uses a real PHP lexer+parser) for any PHP code analysis or transformation — never parse PHP with regex
📦 Applies To
PHP 7.0+ any cli
🔗 Prerequisites
🔍 Detection Hints
Regex used to parse PHP source code; custom token parsing instead of PHP-Parser AST
Auto-detectable: ✗ No php-parser phpstan rector
⚠ Related Problems
🤖 AI Agent
Confidence: Low False Positives: High ✗ Manual fix Fix: High Context: File


✓ schema.org compliant