mb_string — Multibyte String Functions
debt(d7/e3/b5/t7)
Closest to 'only careful code review or runtime testing' (d7). No detection_hints.tools specified for this term. Using native strlen() on UTF-8 strings won't trigger any compiler or linter error — the code runs fine until you see garbled output with international characters. Static analysis tools like PHPStan or Psalm don't flag strlen() vs mb_strlen() misuse by default. Only manual code review or runtime testing with multibyte input reveals the problem.
Closest to 'simple parameterised fix' (e3). The quick_fix describes setting mb_internal_encoding('UTF-8') once and doing a project-wide search-replace of strlen→mb_strlen, substr→mb_substr, etc. This is a mechanical find-and-replace pattern across multiple files but follows a predictable transformation — not architectural, but more than a one-line patch.
Closest to 'persistent productivity tax' (b5). The term applies_to all PHP contexts (web, cli) and affects any string operation on user input. Every developer working on the codebase must remember to use mb_* functions instead of native ones for all new code. This creates a persistent cognitive tax across the team — not quite architectural (you can adopt it incrementally), but it shapes how all string handling must be done.
Closest to 'serious trap' (t7). The misconception explicitly states that developers believe 'mb_internal_encoding("UTF-8") makes all string functions Unicode-safe' — this contradicts how similar global encoding settings work in other languages where they affect all string operations. PHP's design where native functions remain byte-oriented regardless of encoding settings contradicts reasonable expectations from developers coming from Python 3, Ruby, or other Unicode-native environments.
Also Known As
TL;DR
Explanation
A UTF-8 encoded character can occupy 1 to 4 bytes. PHP's strlen() counts bytes — 'héllo' is 6 bytes (é is 2 bytes in UTF-8) not 5 characters. substr(), strtolower(), strtoupper(), and most string functions have the same byte-centric behaviour. The mb_string extension (enabled by default in most PHP installations) provides mb_strlen(), mb_substr(), mb_strtolower(), mb_strtoupper(), mb_strpos(), mb_convert_encoding(), and many others that are character-aware. mb_internal_encoding() sets the default encoding so you don't need to pass it to every function. The Intl extension's Normalizer and Collator classes handle Unicode normalisation and locale-aware comparison beyond what mb_string covers.
Common Misconception
Why It Matters
Common Mistakes
- Using str_split() on UTF-8 strings — it splits by byte, producing broken multibyte sequences; use mb_str_split() (PHP 7.4+).
- Passing encoding parameter inconsistently — mb_substr($s, 0, 10, 'UTF-8') when mb_internal_encoding is already set adds noise; rely on the default after setting it once.
- Forgetting preg_match() with UTF-8 — use the /u modifier for Unicode-aware regex matching: preg_match('/\p{L}+/u', $text).
- Relying on mb_strtolower() for locale-sensitive comparison — Turkish has dotted/dotless i rules that require the Intl Collator, not mb_string.
Code Examples
<?php
// ❌ Byte-based functions on UTF-8 strings
$name = 'Ångström'; // 8 characters, 11 bytes
echo strlen($name); // 11, not 8
echo strtoupper($name); // ÅNGSTRöM — ö not uppercased
echo substr($name, 0, 3); // Ã\x85 — corrupted (splits UTF-8 sequence)
// Truncating a user bio at 100 chars
$bio = $user['bio'];
$preview = substr($bio, 0, 100); // May split a multibyte character at byte 100
<?php
// ✅ mb_string functions — character-aware
mb_internal_encoding('UTF-8');
$name = 'Ångström';
echo mb_strlen($name); // 8
echo mb_strtoupper($name); // ÅNGSTRÖM
echo mb_substr($name, 0, 3); // Ång — correct
// Safe truncation
$preview = mb_strlen($bio) > 100
? mb_substr($bio, 0, 100) . '…'
: $bio;