PHP Intl Extension — Unicode
debt(d3/e3/b5/t5)
Closest to 'default linter catches the common case' (d3). The term's detection_hints indicate automated detection is available via phpinfo, phpstan, and composer. Missing intl extension typically surfaces immediately as 'NumberFormatter not found' errors or Symfony translation component failures — these are caught at runtime on first use or during CI checks with phpstan/composer require checks. Not quite d1 (no compile-time guarantee) but reliably caught early.
Closest to 'simple parameterised fix' (e3). The quick_fix states: 'apt-get install php8.3-intl' — installing the extension is a one-line command, but the remediation also involves updating Docker images, server configurations, and potentially modifying code to use grapheme_* functions instead of mb_* functions. This touches configuration files and potentially multiple code locations, but remains a straightforward parameterised fix pattern.
Closest to 'persistent productivity tax' (b5). The applies_to field shows this affects both web and cli contexts across PHP 5.3+. Once you need proper Unicode handling (grapheme clusters, normalisation, transliteration), every text-processing feature must consider whether it's using the correct Intl functions. This creates ongoing cognitive load — developers must remember to use grapheme_strlen over mb_strlen, Transliterator over strtolower, etc. Not architectural (b7-9) but definitely a persistent tax across many work streams.
Closest to 'notable trap' (t5). The misconception field explicitly states: 'mb_string functions handle all Unicode correctly' — this is the documented gotcha that most PHP devs eventually learn. The why_it_matters example (100-char limit allowing only 14 emoji) demonstrates real confusion. Competent developers familiar with mb_* functions reasonably assume they've solved Unicode, but grapheme clusters and normalisation require Intl. This is a well-known trap in the PHP community, not catastrophic (t9) but more than a minor edge case (t3).
Also Known As
TL;DR
Explanation
Intl extension (wrapping ICU): grapheme_strlen/grapheme_substr (correct for emoji and combining characters — 👨👩👧👦 = 1 grapheme, 7 code points), Normalizer (NFC/NFD Unicode normalisation — required before comparing or storing user text), Transliterator (convert between scripts — Cyrillic to Latin), IntlBreakIterator (word/sentence/character boundaries), IntlChar (Unicode character properties). The family emoji 👨👩👧👦 has 7 Unicode code points — mb_strlen returns 7, grapheme_strlen returns 1.
Common Misconception
Why It Matters
Common Mistakes
- strlen() for character limits on user input — counts bytes not characters
- mb_strlen() for grapheme-aware limits — counts code points not visible grapheme clusters
- Not normalising Unicode before storing — same visual character can have multiple representations causing duplicate key errors
- strtolower() for multilingual case conversion — use IntlChar or Transliterator
Code Examples
// Wrong character counting:
$input = 'Hello 👨👩👧👦'; // 1 visible emoji
$limit = 100;
if (mb_strlen($input) > $limit) { /* trim */ }
// mb_strlen: 'Hello ' = 6, family emoji = 7 code points = 13 total
// User sees 7 visible characters, system counts 13
// Grapheme-correct character counting:
$input = 'Hello 👨👩👧👦';
$limit = 100;
if (grapheme_strlen($input) > $limit) {
$input = grapheme_substr($input, 0, $limit);
}
// grapheme_strlen: 'Hello ' = 6, family emoji = 1 = 7 total — correct!
// Normalise Unicode before storing to prevent duplicate keys:
$normalised = Normalizer::normalize($userInput, Normalizer::FORM_C);
$pdo->prepare('INSERT INTO users (name) VALUES (?)')->execute([$normalised]);