DOMDocument & XPath in PHP
debt(d7/e3/b3/t5)
Closest to 'only careful code review or runtime testing' (d7). The detection_hints.tools field is not specified. The common mistakes — flooding logs with warnings, getElementById failing silently, XPath 1-indexing returning wrong nodes, namespace misuse — all produce silent wrong results or runtime surprises rather than compile-time or linter-caught errors. No standard PHP linter catches these patterns automatically; they surface only during testing or in production logs.
Closest to 'simple parameterised fix' (e3). The quick_fix is a two-call suppression pattern (libxml_use_internal_errors(true) + libxml_clear_errors()), and other fixes like switching to XPath from getElementById or adding namespace prefixes are small targeted changes within one component. No cross-file refactoring is required for most of these corrections.
Closest to 'localised tax' (b3). DOMDocument usage is typically confined to the parsing/scraping layer of an application — one component pays the ergonomic cost (verbose API, namespace quirks, memory for large docs) while the rest of the codebase remains unaffected. The applies_to scope covers web and cli contexts but the structural weight stays within the component using it.
Closest to 'notable trap (a documented gotcha most devs eventually learn)' (t5). The misconception is that DOMDocument is XML-only and unsuitable for HTML — causing developers to reach for regex instead. Additionally, XPath 1-indexing (positions start at 1, not 0) directly contradicts PHP array and most developer intuitions, and the warning-flooding behavior on malformed HTML surprises most first-time users. These are documented gotchas but not catastrophic in the sense of always producing wrong output silently.
Also Known As
TL;DR
Explanation
DOMDocument implements the W3C DOM specification — the same tree structure browsers use. loadHTML() parses HTML (even malformed HTML, with some tolerance), loadXML() parses strict XML. Once loaded, you traverse the tree via getElementById(), getElementsByTagName(), and childNodes. DOMXPath wraps the document with an XPath 1.0 engine: xpath->query('//div[@class="product"]/span[@class="price"]') returns a DOMNodeList of all matching nodes. For web scraping, extracting configuration from XML, or processing HTML email templates, DOMDocument+XPath is the correct tool. For large XML processing, XMLReader streams the document without loading it all into memory.
Common Misconception
Why It Matters
Common Mistakes
- Not calling libxml_use_internal_errors(true) — DOMDocument emits PHP warnings for every HTML quirk in real-world pages, flooding your logs.
- Using getElementById() without a DOCTYPE that defines ID attributes — in strict XML mode, id attributes may not be recognised; use XPath or getElementsByTagName() instead.
- Loading very large HTML/XML files into DOMDocument — the entire document is held in memory; use XMLReader for streaming large files.
- Forgetting that XPath is 1-indexed — XPath node positions start at 1, not 0: //li[1] is the first item, not //li[0].
- Using registerNodeNS() incorrectly for default XML namespaces — unprefixed elements in a namespaced document still need an explicit prefix in your XPath query.
Code Examples
<?php
// ❌ Regex on HTML — breaks on nested tags, attributes, variations
$html = file_get_contents('https://example.com/products');
preg_match_all('/<span class="price">(.*?)<\/span>/s', $html, $matches);
// Breaks if: class has extra whitespace, tag has other attributes, nested spans exist
<?php
// ✅ DOMDocument + XPath — correct HTML parsing
libxml_use_internal_errors(true); // Suppress malformed HTML warnings
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_NOWARNING | LIBXML_NOERROR);
$xpath = new DOMXPath($dom);
// XPath: all span elements with class containing 'price'
$prices = $xpath->query('//span[contains(@class, "price")]');
foreach ($prices as $node) {
echo trim($node->textContent) . PHP_EOL;
}
libxml_clear_errors();
// Modifying HTML
$title = $dom->getElementsByTagName('title')->item(0);
if ($title) {
$title->textContent = 'New Title';
}
echo $dom->saveHTML();