Lexing & Parsing
debt(d7/e5/b3/t7)
Closest to 'only careful code review or runtime testing' (d7). The detection_hints indicate automated detection is 'no' and the code_pattern is 'regex used to parse PHP source code' — a competent reviewer must spot the regex-based parsing pattern. Tools like php-parser, phpstan, and rector can surface AST-related issues but won't automatically flag 'you should be using a real parser instead of regex' in an automated pipeline.
Closest to 'touches multiple files / significant refactor in one component' (e5). The quick_fix says to use nikic/php-parser, but migrating from regex-based PHP parsing to a full AST-based approach typically requires rewriting the analysis/transformation logic to traverse AST nodes, update parent references, and handle byte offsets — this is more than a one-line swap and likely touches multiple files, though it is contained to the tooling component rather than being cross-cutting.
Closest to 'localised tax' (b3). The applies_to scope is cli contexts only (tooling/static analysis work), meaning this choice does not affect runtime application code broadly. The burden is felt primarily within the code analysis or transformation component, leaving the rest of the codebase unaffected.
Closest to 'serious trap (contradicts how a similar concept works elsewhere)' (t7). The misconception field directly states the canonical wrong belief: developers assume PHP source is executed directly, when in reality it is lexed → parsed → compiled → executed. This contradicts the mental model many developers carry from scripting or interpreted language contexts, and the common mistake of using regex to parse PHP (which cannot handle recursive structures) confirms this is a serious, systematic misunderstanding.
Also Known As
TL;DR
Explanation
Lexing (tokenisation): scans characters and groups them into meaningful tokens — T_FUNCTION, T_STRING, T_WHITESPACE. Whitespace and comments are typically discarded. Parsing: takes the token stream and builds an AST according to the language grammar (typically expressed as a context-free grammar). PHP's token_get_all() exposes the lexer output. Nikic's PHP-Parser builds the full AST. Applications: PHPStan traverses the AST for type checking, Rector modifies the AST for code transformations, php-cs-fixer analyses token structure for style checking.
Common Misconception
Why It Matters
Common Mistakes
- Parsing PHP with regex — regex cannot handle recursive structures like nested expressions.
- Not understanding that the AST represents semantics not formatting — reformatting does not change the AST.
- Token positions are byte offsets not character positions — important for multibyte PHP source.
- Modifying AST nodes without updating their parent references — causes tree inconsistency.
Code Examples
// Regex parsing of PHP — brittle and wrong:
$functions = [];
preg_match_all('/function\s+(\w+)\s*\(/', $source, $matches);
// Misses: closures, arrow functions, methods, functions in strings
// Breaks on: comments containing 'function', heredoc, nested structures
// PHP-Parser — correct AST-based analysis:
use PhpParser\ParserFactory;
use PhpParser\NodeTraverser;
use PhpParser\NodeVisitorAbstract;
use PhpParser\Node;
$parser = (new ParserFactory)->createForNewestSupportedVersion();
$ast = $parser->parse($sourceCode);
$traverser = new NodeTraverser();
$traverser->addVisitor(new class extends NodeVisitorAbstract {
public function enterNode(Node $node): void {
if ($node instanceof Node\Stmt\Function_) {
echo 'Found function: ' . $node->name . PHP_EOL;
}
}
});
$traverser->traverse($ast);