When should you NOT use Unicode Normalisation Attack?

Do not skip normalisation for filenames — different Unicode representations of the same filename can bypass path traversal filters. Do not compare usernames or emails for uniqueness without first normalising — homograph attacks exploit visually identical but byte-different characters.

When is Unicode Normalisation Attack the right choice?

Normalise all user-supplied strings to NFC form before comparison, storage, or security checks. Apply unicode normalisation before checking string equality — 'café' composed vs decomposed are different byte sequences but the same character.

← Back to glossary

Unicode Normalisation Attack

security CWE-176 OWASP A3:2021 CVSS 5.3 PHP 5.3+ Advanced

Also Known As

Unicode normalisation homograph attack Unicode bypass

TL;DR

Exploiting differences in Unicode normalisation forms to bypass input filters — two visually identical strings that differ at the byte level.

Explanation

Unicode defines multiple normalisation forms (NFC, NFD, NFKC, NFKD). A character like 'é' can be represented as a single codepoint U+00E9 or as 'e' plus a combining accent U+0301. A blocklist filter checking raw bytes may miss the decomposed form, which later normalises to the blocked character after storage or rendering. PHP's intl extension provides Normalizer::normalize() to canonicalise input before validation. Always normalise user input to NFC before validation and storage; never rely solely on byte-level comparisons for security-sensitive checks.

Common Misconception

✗ Unicode characters are handled consistently across PHP string functions and the database. Different normalisation forms (NFC, NFD, NFKC) can cause the same visual string to have different byte representations, bypassing filters that operate on raw bytes rather than normalised forms.

Why It Matters

The same visual string can have multiple Unicode representations — filters that match one representation miss others, enabling bypass of username uniqueness checks, XSS filters, or path restrictions.

Common Mistakes

Comparing usernames or email addresses for uniqueness without normalising to NFC or NFKC first.
XSS filters that match ASCII angle brackets but miss Unicode alternatives that browsers render identically.
File path security checks that pass but the filesystem resolves to a different path via Unicode canonicalization.
Not using Normalizer::normalize() before storing or comparing user-supplied strings containing Unicode.

Avoid When

Do not skip normalisation for filenames — different Unicode representations of the same filename can bypass path traversal filters.
Do not compare usernames or emails for uniqueness without first normalising — homograph attacks exploit visually identical but byte-different characters.

When To Use

Normalise all user-supplied strings to NFC form before comparison, storage, or security checks.
Apply unicode normalisation before checking string equality — 'café' composed vs decomposed are different byte sequences but the same character.

Code Examples

✗ Vulnerable

// Different Unicode representations of same visual character
// Filter applied before normalisation — bypass possible
\$input = \$_POST['username'];
if (preg_match('/[<>"\']/u', \$input)) abort(400);
// Attacker sends NFD form of < → passes filter → XSS in output

✓ Fixed

// Normalise to NFC BEFORE any validation
\$input = Normalizer::normalize(\$_POST['username'] ?? '', Normalizer::FORM_C);

// Now validate on canonical representation
if (!preg_match('/^[\p{L}\p{N}._@-]+$/u', \$input)) abort(400);

// Prevent homograph attacks (visually identical chars from different scripts):
// 'a' (U+0061 Latin) vs 'а' (U+0430 Cyrillic) — restrict to expected script

References

↗ https://www.php.net/manual/en/class.normalizer.php

Tags

Added 15 Mar 2026

Edited 31 Mar 2026

Curated in Warsaw under one editorial standard. 1,445 terms, single voice. About this reference →

Rate this term

No ratings yet

🤖 AI Guestbook educational data only

| |

Last 30 days

Agents 0

No pings yet today

No pings yesterday

Amazonbot 8 Perplexity 6 Unknown AI 4 Google 2 Ahrefs 2 ChatGPT 1 Meta AI 1 Majestic 1

Also referenced

File Extension Bypass 78 Input Validation vs Output Encoding 32 Allowlist vs Blocklist 26

How they use it

crawler 24 pre-tracking 1

Related categories

security 4.2k general 1.5k

⚡ DEV INTEL Tools & Severity

🟡 Medium ⚙ Fix effort: Low

⚡ Quick Fix

Normalise user input to NFC form before storing — the same visual character can have multiple Unicode representations (é as single codepoint vs e + combining accent), causing comparison failures

📦 Applies To

PHP 5.3+ web cli

🔗 Prerequisites

Unicode Fundamentals Character Encoding PHP Intl Extension — Unicode

🔍 Detection Hints

Username duplicate check failing despite visually identical names; emoji comparison issues; string comparison failing on accented characters with different normalisation

Auto-detectable: ✗ No phpstan intl-extension

⚠ Related Problems