← Back to glossary

Lexing & Parsing

Compiler PHP 7.0+ Advanced

debt(d7/e5/b3/t7)

d7 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'only careful code review or runtime testing' (d7). The detection_hints indicate automated detection is 'no' and the code_pattern is 'regex used to parse PHP source code' — a competent reviewer must spot the regex-based parsing pattern. Tools like php-parser, phpstan, and rector can surface AST-related issues but won't automatically flag 'you should be using a real parser instead of regex' in an automated pipeline.

e5 Effort Remediation debt — work required to fix once spotted

Closest to 'touches multiple files / significant refactor in one component' (e5). The quick_fix says to use nikic/php-parser, but migrating from regex-based PHP parsing to a full AST-based approach typically requires rewriting the analysis/transformation logic to traverse AST nodes, update parent references, and handle byte offsets — this is more than a one-line swap and likely touches multiple files, though it is contained to the tooling component rather than being cross-cutting.

b3 Burden Structural debt — long-term weight of choosing wrong

Closest to 'localised tax' (b3). The applies_to scope is cli contexts only (tooling/static analysis work), meaning this choice does not affect runtime application code broadly. The burden is felt primarily within the code analysis or transformation component, leaving the rest of the codebase unaffected.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap (contradicts how a similar concept works elsewhere)' (t7). The misconception field directly states the canonical wrong belief: developers assume PHP source is executed directly, when in reality it is lexed → parsed → compiled → executed. This contradicts the mental model many developers carry from scripting or interpreted language contexts, and the common mistake of using regex to parse PHP (which cannot handle recursive structures) confirms this is a serious, systematic misunderstanding.

About DEBT scoring → scored by claude-sonnet-4-6 · 2026-05-10 · reviewed by human

Also Known As

tokeniser lexer parser AST token_get_all

TL;DR

Two stages of language processing — the lexer converts source text to tokens, the parser converts tokens to an Abstract Syntax Tree representing the program's structure.

Explanation

Lexing (tokenisation): scans characters and groups them into meaningful tokens — T_FUNCTION, T_STRING, T_WHITESPACE. Whitespace and comments are typically discarded. Parsing: takes the token stream and builds an AST according to the language grammar (typically expressed as a context-free grammar). PHP's token_get_all() exposes the lexer output. Nikic's PHP-Parser builds the full AST. Applications: PHPStan traverses the AST for type checking, Rector modifies the AST for code transformations, php-cs-fixer analyses token structure for style checking.

Common Misconception

✗ PHP source code is executed directly — PHP first lexes to tokens, parses to AST, compiles to opcodes, then executes — the raw source text never runs directly.

Why It Matters

Understanding lexing and parsing explains how PHPStan finds type errors before running code, why Rector can safely refactor thousands of files, and how you can build custom static analysis tools.

Common Mistakes

Parsing PHP with regex — regex cannot handle recursive structures like nested expressions.
Not understanding that the AST represents semantics not formatting — reformatting does not change the AST.
Token positions are byte offsets not character positions — important for multibyte PHP source.
Modifying AST nodes without updating their parent references — causes tree inconsistency.

Code Examples

✗ Vulnerable

// Regex parsing of PHP — brittle and wrong:
$functions = [];
preg_match_all('/function\s+(\w+)\s*\(/', $source, $matches);
// Misses: closures, arrow functions, methods, functions in strings
// Breaks on: comments containing 'function', heredoc, nested structures

✓ Fixed

// PHP-Parser — correct AST-based analysis:
use PhpParser\ParserFactory;
use PhpParser\NodeTraverser;
use PhpParser\NodeVisitorAbstract;
use PhpParser\Node;

$parser   = (new ParserFactory)->createForNewestSupportedVersion();
$ast      = $parser->parse($sourceCode);

$traverser = new NodeTraverser();
$traverser->addVisitor(new class extends NodeVisitorAbstract {
    public function enterNode(Node $node): void {
        if ($node instanceof Node\Stmt\Function_) {
            echo 'Found function: ' . $node->name . PHP_EOL;
        }
    }
});
$traverser->traverse($ast);

Tags

compiler php tooling

Added 16 Mar 2026

Edited 5 Apr 2026

Curated in Warsaw under one editorial standard. 1,506 terms, single voice. About this reference →

Rate this term

No ratings yet

🤖 AI Guestbook educational data only

| |

Last 30 days

Agents 1

Claude 1

PetalBot 1

Amazonbot 10 Perplexity 8 Google 5 Ahrefs 4 Scrapy 4 Claude 3 SEMrush 3 PetalBot 3 Majestic 2 Unknown AI 2 Bing 1 Meta AI 1

Also referenced

Static Analysis 107 Abstract Syntax Tree (AST) 78 PHP Compilation Pipeline 73

How they use it

crawler 40 crawler_json 5 pre-tracking 1

Related categories

general 3k compiler 797

⚡ DEV INTEL Tools & Severity

🔵 Info ⚙ Fix effort: High

⚡ Quick Fix

Use nikic/php-parser (which uses a real PHP lexer+parser) for any PHP code analysis or transformation — never parse PHP with regex

📦 Applies To

PHP 7.0+ any cli

🔗 Prerequisites

Abstract Syntax Tree (AST) PHP Compilation Pipeline Bytecode VMs

🔍 Detection Hints

Regex used to parse PHP source code; custom token parsing instead of PHP-Parser AST

Auto-detectable: ✗ No php-parser phpstan rector

⚠ Related Problems

Abstract Syntax Tree (AST) PHP Compilation Pipeline Static Analysis

🤖 AI Agent

Confidence: Low False Positives: High ✗ Manual fix Fix: High Context: File

References

https://github.com/nikic/PHP-Parser