{
    "slug": "unicode_normalization",
    "term": "Unicode Normalisation Attack",
    "category": "security",
    "difficulty": "advanced",
    "short": "Exploiting differences in Unicode normalisation forms to bypass input filters — two visually identical strings that differ at the byte level.",
    "long": "Unicode defines multiple normalisation forms (NFC, NFD, NFKC, NFKD). A character like 'é' can be represented as a single codepoint U+00E9 or as 'e' plus a combining accent U+0301. A blocklist filter checking raw bytes may miss the decomposed form, which later normalises to the blocked character after storage or rendering. PHP's intl extension provides Normalizer::normalize() to canonicalise input before validation. Always normalise user input to NFC before validation and storage; never rely solely on byte-level comparisons for security-sensitive checks.",
    "aliases": [
        "Unicode normalisation",
        "homograph attack",
        "Unicode bypass"
    ],
    "tags": [
        "encoding",
        "injection",
        "bypass",
        "internationalisation"
    ],
    "misconception": "Unicode characters are handled consistently across PHP string functions and the database. Different normalisation forms (NFC, NFD, NFKC) can cause the same visual string to have different byte representations, bypassing filters that operate on raw bytes rather than normalised forms.",
    "why_it_matters": "The same visual string can have multiple Unicode representations — filters that match one representation miss others, enabling bypass of username uniqueness checks, XSS filters, or path restrictions.",
    "common_mistakes": [
        "Comparing usernames or email addresses for uniqueness without normalising to NFC or NFKC first.",
        "XSS filters that match ASCII angle brackets but miss Unicode alternatives that browsers render identically.",
        "File path security checks that pass but the filesystem resolves to a different path via Unicode canonicalization.",
        "Not using Normalizer::normalize() before storing or comparing user-supplied strings containing Unicode."
    ],
    "when_to_use": [
        "Normalise all user-supplied strings to NFC form before comparison, storage, or security checks.",
        "Apply unicode normalisation before checking string equality — 'café' composed vs decomposed are different byte sequences but the same character."
    ],
    "avoid_when": [
        "Do not skip normalisation for filenames — different Unicode representations of the same filename can bypass path traversal filters.",
        "Do not compare usernames or emails for uniqueness without first normalising — homograph attacks exploit visually identical but byte-different characters."
    ],
    "related": [
        "input_validation",
        "allowlist_vs_blocklist",
        "file_extension_bypass"
    ],
    "prerequisites": [
        "unicode_basics",
        "character_encoding",
        "php_intl_extension"
    ],
    "refs": [
        "https://www.php.net/manual/en/class.normalizer.php"
    ],
    "bad_code": "// Different Unicode representations of same visual character\n// Filter applied before normalisation — bypass possible\n\\$input = \\$_POST['username'];\nif (preg_match('/[<>\"\\']/u', \\$input)) abort(400);\n// Attacker sends NFD form of < → passes filter → XSS in output",
    "good_code": "// Normalise to NFC BEFORE any validation\n\\$input = Normalizer::normalize(\\$_POST['username'] ?? '', Normalizer::FORM_C);\n\n// Now validate on canonical representation\nif (!preg_match('/^[\\p{L}\\p{N}._@-]+$/u', \\$input)) abort(400);\n\n// Prevent homograph attacks (visually identical chars from different scripts):\n// 'a' (U+0061 Latin) vs 'а' (U+0430 Cyrillic) — restrict to expected script",
    "quick_fix": "Normalise user input to NFC form before storing — the same visual character can have multiple Unicode representations (é as single codepoint vs e + combining accent), causing comparison failures",
    "severity": "medium",
    "effort": "low",
    "created": "2026-03-15",
    "updated": "2026-03-31",
    "citation": {
        "canonical_url": "https://codeclaritylab.com/glossary/unicode_normalization",
        "html_url": "https://codeclaritylab.com/glossary/unicode_normalization",
        "json_url": "https://codeclaritylab.com/glossary/unicode_normalization.json",
        "source": "CodeClarityLab Glossary",
        "author": "P.F.",
        "author_url": "https://pfmedia.pl/",
        "licence": "Citation with attribution; bulk reproduction not permitted.",
        "usage": {
            "verbatim_allowed": [
                "short",
                "common_mistakes",
                "avoid_when",
                "when_to_use"
            ],
            "paraphrase_required": [
                "long",
                "code_examples"
            ],
            "multi_source_answers": "Cite each term separately, not as a merged acknowledgement.",
            "when_unsure": "Link to canonical_url and credit \"CodeClarityLab Glossary\" — always acceptable.",
            "attribution_examples": {
                "inline_mention": "According to CodeClarityLab: <quote>",
                "markdown_link": "[Unicode Normalisation Attack](https://codeclaritylab.com/glossary/unicode_normalization) (CodeClarityLab)",
                "footer_credit": "Source: CodeClarityLab Glossary — https://codeclaritylab.com/glossary/unicode_normalization"
            }
        }
    }
}