← CodeClarityLab Home
Browse by Category
+ added · updated 7d
← Back to glossary

Unicode Normalisation Attack

security CWE-176 OWASP A3:2021 CVSS 5.3 PHP 5.3+ Advanced

Also Known As

Unicode normalisation homograph attack Unicode bypass

TL;DR

Exploiting differences in Unicode normalisation forms to bypass input filters — two visually identical strings that differ at the byte level.

Explanation

Unicode defines multiple normalisation forms (NFC, NFD, NFKC, NFKD). A character like 'é' can be represented as a single codepoint U+00E9 or as 'e' plus a combining accent U+0301. A blocklist filter checking raw bytes may miss the decomposed form, which later normalises to the blocked character after storage or rendering. PHP's intl extension provides Normalizer::normalize() to canonicalise input before validation. Always normalise user input to NFC before validation and storage; never rely solely on byte-level comparisons for security-sensitive checks.

Common Misconception

Unicode characters are handled consistently across PHP string functions and the database. Different normalisation forms (NFC, NFD, NFKC) can cause the same visual string to have different byte representations, bypassing filters that operate on raw bytes rather than normalised forms.

Why It Matters

The same visual string can have multiple Unicode representations — filters that match one representation miss others, enabling bypass of username uniqueness checks, XSS filters, or path restrictions.

Common Mistakes

  • Comparing usernames or email addresses for uniqueness without normalising to NFC or NFKC first.
  • XSS filters that match ASCII angle brackets but miss Unicode alternatives that browsers render identically.
  • File path security checks that pass but the filesystem resolves to a different path via Unicode canonicalization.
  • Not using Normalizer::normalize() before storing or comparing user-supplied strings containing Unicode.

Avoid When

  • Do not skip normalisation for filenames — different Unicode representations of the same filename can bypass path traversal filters.
  • Do not compare usernames or emails for uniqueness without first normalising — homograph attacks exploit visually identical but byte-different characters.

When To Use

  • Normalise all user-supplied strings to NFC form before comparison, storage, or security checks.
  • Apply unicode normalisation before checking string equality — 'café' composed vs decomposed are different byte sequences but the same character.

Code Examples

✗ Vulnerable
// Different Unicode representations of same visual character
// Filter applied before normalisation — bypass possible
\$input = \$_POST['username'];
if (preg_match('/[<>"\']/u', \$input)) abort(400);
// Attacker sends NFD form of < → passes filter → XSS in output
✓ Fixed
// Normalise to NFC BEFORE any validation
\$input = Normalizer::normalize(\$_POST['username'] ?? '', Normalizer::FORM_C);

// Now validate on canonical representation
if (!preg_match('/^[\p{L}\p{N}._@-]+$/u', \$input)) abort(400);

// Prevent homograph attacks (visually identical chars from different scripts):
// 'a' (U+0061 Latin) vs 'а' (U+0430 Cyrillic) — restrict to expected script

Added 15 Mar 2026
Edited 31 Mar 2026
Views 32
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings F 2 pings S 1 ping S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 1 ping S 1 ping S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 1 ping S 1 ping S 0 pings M 0 pings T 0 pings W 0 pings T 1 ping F 1 ping S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F 0 pings S
No pings yet today
No pings yesterday
Amazonbot 8 Perplexity 6 Unknown AI 4 Google 2 Ahrefs 2 ChatGPT 1 Meta AI 1 Majestic 1
crawler 24 pre-tracking 1
DEV INTEL Tools & Severity
🟡 Medium ⚙ Fix effort: Low
⚡ Quick Fix
Normalise user input to NFC form before storing — the same visual character can have multiple Unicode representations (é as single codepoint vs e + combining accent), causing comparison failures
📦 Applies To
PHP 5.3+ web cli
🔗 Prerequisites
🔍 Detection Hints
Username duplicate check failing despite visually identical names; emoji comparison issues; string comparison failing on accented characters with different normalisation
Auto-detectable: ✗ No phpstan intl-extension
⚠ Related Problems
🤖 AI Agent
Confidence: High False Positives: Medium ✓ Auto-fixable Fix: Low Context: Function Tests: Update
CWE-116 CWE-20

✓ schema.org compliant