← Home ← Codex ← DEBT
Browse by Category
+ added · updated 7d
← Back to glossary

DOMDocument & XPath in PHP

PHP PHP 5.0+ Intermediate
debt(d7/e3/b3/t5)
d7 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'only careful code review or runtime testing' (d7). The detection_hints.tools field is not specified. The common mistakes — flooding logs with warnings, getElementById failing silently, XPath 1-indexing returning wrong nodes, namespace misuse — all produce silent wrong results or runtime surprises rather than compile-time or linter-caught errors. No standard PHP linter catches these patterns automatically; they surface only during testing or in production logs.

e3 Effort Remediation debt — work required to fix once spotted

Closest to 'simple parameterised fix' (e3). The quick_fix is a two-call suppression pattern (libxml_use_internal_errors(true) + libxml_clear_errors()), and other fixes like switching to XPath from getElementById or adding namespace prefixes are small targeted changes within one component. No cross-file refactoring is required for most of these corrections.

b3 Burden Structural debt — long-term weight of choosing wrong

Closest to 'localised tax' (b3). DOMDocument usage is typically confined to the parsing/scraping layer of an application — one component pays the ergonomic cost (verbose API, namespace quirks, memory for large docs) while the rest of the codebase remains unaffected. The applies_to scope covers web and cli contexts but the structural weight stays within the component using it.

t5 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'notable trap (a documented gotcha most devs eventually learn)' (t5). The misconception is that DOMDocument is XML-only and unsuitable for HTML — causing developers to reach for regex instead. Additionally, XPath 1-indexing (positions start at 1, not 0) directly contradicts PHP array and most developer intuitions, and the warning-flooding behavior on malformed HTML surprises most first-time users. These are documented gotchas but not catastrophic in the sense of always producing wrong output silently.

About DEBT scoring →

Also Known As

DOMDocument DOMXPath XPath PHP PHP HTML parser PHP XML parser

TL;DR

PHP's DOMDocument extension parses HTML and XML into a traversable tree. Combined with DOMXPath, you can query the document with XPath expressions — far more powerful than regex for extracting data from HTML.

Explanation

DOMDocument implements the W3C DOM specification — the same tree structure browsers use. loadHTML() parses HTML (even malformed HTML, with some tolerance), loadXML() parses strict XML. Once loaded, you traverse the tree via getElementById(), getElementsByTagName(), and childNodes. DOMXPath wraps the document with an XPath 1.0 engine: xpath->query('//div[@class="product"]/span[@class="price"]') returns a DOMNodeList of all matching nodes. For web scraping, extracting configuration from XML, or processing HTML email templates, DOMDocument+XPath is the correct tool. For large XML processing, XMLReader streams the document without loading it all into memory.

Common Misconception

DOMDocument is only for XML. loadHTML() parses HTML documents including malformed ones — it applies HTML5-like error tolerance. It is perfectly suited for web scraping and HTML manipulation tasks.

Why It Matters

Parsing HTML with regex is a well-known anti-pattern — HTML's irregular structure breaks regex reliably. DOMDocument handles the messy reality of real-world HTML including unclosed tags, mixed case, and malformed attributes. XPath queries are concise, readable, and correct in ways regex cannot be.

Common Mistakes

  • Not calling libxml_use_internal_errors(true) — DOMDocument emits PHP warnings for every HTML quirk in real-world pages, flooding your logs.
  • Using getElementById() without a DOCTYPE that defines ID attributes — in strict XML mode, id attributes may not be recognised; use XPath or getElementsByTagName() instead.
  • Loading very large HTML/XML files into DOMDocument — the entire document is held in memory; use XMLReader for streaming large files.
  • Forgetting that XPath is 1-indexed — XPath node positions start at 1, not 0: //li[1] is the first item, not //li[0].
  • Using registerNodeNS() incorrectly for default XML namespaces — unprefixed elements in a namespaced document still need an explicit prefix in your XPath query.

Code Examples

✗ Vulnerable
<?php
// ❌ Regex on HTML — breaks on nested tags, attributes, variations
$html = file_get_contents('https://example.com/products');
preg_match_all('/<span class="price">(.*?)<\/span>/s', $html, $matches);
// Breaks if: class has extra whitespace, tag has other attributes, nested spans exist
✓ Fixed
<?php
// ✅ DOMDocument + XPath — correct HTML parsing
libxml_use_internal_errors(true); // Suppress malformed HTML warnings

$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_NOWARNING | LIBXML_NOERROR);

$xpath = new DOMXPath($dom);

// XPath: all span elements with class containing 'price'
$prices = $xpath->query('//span[contains(@class, "price")]');

foreach ($prices as $node) {
    echo trim($node->textContent) . PHP_EOL;
}

libxml_clear_errors();

// Modifying HTML
$title = $dom->getElementsByTagName('title')->item(0);
if ($title) {
    $title->textContent = 'New Title';
}
echo $dom->saveHTML();

Added 23 Mar 2026
Edited 18 Apr 2026
Views 39
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings T 1 ping W 1 ping T 0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 1 ping T 0 pings F 1 ping S 1 ping S 0 pings M 0 pings T 0 pings W 0 pings T 1 ping F 0 pings S 0 pings S 0 pings M 0 pings T 1 ping W 0 pings T 0 pings F 0 pings S 1 ping S 1 ping M 0 pings T 0 pings W
No pings yet today
No pings yesterday
Amazonbot 7 ChatGPT 3 Perplexity 3 Ahrefs 3 SEMrush 3 Meta AI 2 Google 2 Scrapy 2 Claude 1 PetalBot 1
crawler 25 crawler_json 2
DEV INTEL Tools & Severity
⚙ Fix effort: Medium
⚡ Quick Fix
Suppress DOMDocument's warnings about malformed HTML with libxml_use_internal_errors(true) + libxml_clear_errors() — they are usually ignorable for real-world HTML scraping.
📦 Applies To
PHP 5.0+ web cli
🔗 Prerequisites


✓ schema.org compliant