DOMDocument & XPath in PHP
Also Known As
DOMDocument
DOMXPath
XPath PHP
PHP HTML parser
PHP XML parser
TL;DR
PHP's DOMDocument extension parses HTML and XML into a traversable tree. Combined with DOMXPath, you can query the document with XPath expressions — far more powerful than regex for extracting data from HTML.
Explanation
DOMDocument implements the W3C DOM specification — the same tree structure browsers use. loadHTML() parses HTML (even malformed HTML, with some tolerance), loadXML() parses strict XML. Once loaded, you traverse the tree via getElementById(), getElementsByTagName(), and childNodes. DOMXPath wraps the document with an XPath 1.0 engine: xpath->query('//div[@class="product"]/span[@class="price"]') returns a DOMNodeList of all matching nodes. For web scraping, extracting configuration from XML, or processing HTML email templates, DOMDocument+XPath is the correct tool. For large XML processing, XMLReader streams the document without loading it all into memory.
Common Misconception
✗ DOMDocument is only for XML. loadHTML() parses HTML documents including malformed ones — it applies HTML5-like error tolerance. It is perfectly suited for web scraping and HTML manipulation tasks.
Why It Matters
Parsing HTML with regex is a well-known anti-pattern — HTML's irregular structure breaks regex reliably. DOMDocument handles the messy reality of real-world HTML including unclosed tags, mixed case, and malformed attributes. XPath queries are concise, readable, and correct in ways regex cannot be.
Common Mistakes
- Not calling libxml_use_internal_errors(true) — DOMDocument emits PHP warnings for every HTML quirk in real-world pages, flooding your logs.
- Using getElementById() without a DOCTYPE that defines ID attributes — in strict XML mode, id attributes may not be recognised; use XPath or getElementsByTagName() instead.
- Loading very large HTML/XML files into DOMDocument — the entire document is held in memory; use XMLReader for streaming large files.
- Forgetting that XPath is 1-indexed — XPath node positions start at 1, not 0: //li[1] is the first item, not //li[0].
- Using registerNodeNS() incorrectly for default XML namespaces — unprefixed elements in a namespaced document still need an explicit prefix in your XPath query.
Code Examples
✗ Vulnerable
<?php
// ❌ Regex on HTML — breaks on nested tags, attributes, variations
$html = file_get_contents('https://example.com/products');
preg_match_all('/<span class="price">(.*?)<\/span>/s', $html, $matches);
// Breaks if: class has extra whitespace, tag has other attributes, nested spans exist
✓ Fixed
<?php
// ✅ DOMDocument + XPath — correct HTML parsing
libxml_use_internal_errors(true); // Suppress malformed HTML warnings
$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_NOWARNING | LIBXML_NOERROR);
$xpath = new DOMXPath($dom);
// XPath: all span elements with class containing 'price'
$prices = $xpath->query('//span[contains(@class, "price")]');
foreach ($prices as $node) {
echo trim($node->textContent) . PHP_EOL;
}
libxml_clear_errors();
// Modifying HTML
$title = $dom->getElementsByTagName('title')->item(0);
if ($title) {
$title->textContent = 'New Title';
}
echo $dom->saveHTML();
References
Tags
🤝 Adopt this term
£79/year · your link shown here
Added
23 Mar 2026
Edited
18 Apr 2026
Views
20
🤖 AI Guestbook educational data only
|
|
Last 30 days
Agents 0
No pings yet today
No pings yesterday
Amazonbot 6
Perplexity 3
ChatGPT 1
Meta AI 1
Google 1
Ahrefs 1
Also referenced
How they use it
crawler 13
Related categories
⚡
DEV INTEL
Tools & Severity
⚙ Fix effort: Medium
⚡ Quick Fix
Suppress DOMDocument's warnings about malformed HTML with libxml_use_internal_errors(true) + libxml_clear_errors() — they are usually ignorable for real-world HTML scraping.
📦 Applies To
PHP 5.0+
web
cli