← CodeClarityLab Home
Browse by Category
+ added · updated 7d
← Back to glossary

mb_string — Multibyte String Functions

php PHP 4.0+ Intermediate
debt(d7/e3/b5/t7)
d7 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'only careful code review or runtime testing' (d7). No detection_hints.tools specified for this term. Using native strlen() on UTF-8 strings won't trigger any compiler or linter error — the code runs fine until you see garbled output with international characters. Static analysis tools like PHPStan or Psalm don't flag strlen() vs mb_strlen() misuse by default. Only manual code review or runtime testing with multibyte input reveals the problem.

e3 Effort Remediation debt — work required to fix once spotted

Closest to 'simple parameterised fix' (e3). The quick_fix describes setting mb_internal_encoding('UTF-8') once and doing a project-wide search-replace of strlen→mb_strlen, substr→mb_substr, etc. This is a mechanical find-and-replace pattern across multiple files but follows a predictable transformation — not architectural, but more than a one-line patch.

b5 Burden Structural debt — long-term weight of choosing wrong

Closest to 'persistent productivity tax' (b5). The term applies_to all PHP contexts (web, cli) and affects any string operation on user input. Every developer working on the codebase must remember to use mb_* functions instead of native ones for all new code. This creates a persistent cognitive tax across the team — not quite architectural (you can adopt it incrementally), but it shapes how all string handling must be done.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap' (t7). The misconception explicitly states that developers believe 'mb_internal_encoding("UTF-8") makes all string functions Unicode-safe' — this contradicts how similar global encoding settings work in other languages where they affect all string operations. PHP's design where native functions remain byte-oriented regardless of encoding settings contradicts reasonable expectations from developers coming from Python 3, Ruby, or other Unicode-native environments.

About DEBT scoring →

Also Known As

mb_string multibyte string mb_strlen mb_substr mbstring

TL;DR

PHP's native string functions operate on bytes, not characters. The mb_string extension provides mb_strlen(), mb_substr(), mb_strtolower() and 100+ equivalents that correctly handle multibyte encodings like UTF-8.

Explanation

A UTF-8 encoded character can occupy 1 to 4 bytes. PHP's strlen() counts bytes — 'héllo' is 6 bytes (é is 2 bytes in UTF-8) not 5 characters. substr(), strtolower(), strtoupper(), and most string functions have the same byte-centric behaviour. The mb_string extension (enabled by default in most PHP installations) provides mb_strlen(), mb_substr(), mb_strtolower(), mb_strtoupper(), mb_strpos(), mb_convert_encoding(), and many others that are character-aware. mb_internal_encoding() sets the default encoding so you don't need to pass it to every function. The Intl extension's Normalizer and Collator classes handle Unicode normalisation and locale-aware comparison beyond what mb_string covers.

Common Misconception

Setting mb_internal_encoding('UTF-8') makes all string functions Unicode-safe. It only affects mb_* functions, not the native strlen(), substr(), etc. You must explicitly use the mb_ variants.

Why It Matters

Every PHP application that handles user input in languages other than ASCII — which is most applications globally — needs mb_string. Truncating a UTF-8 string with substr() at a byte boundary mid-character produces corrupted output. Sorting or comparing strings with strcasecmp() ignores locale rules. These are not edge cases — they affect any application with international users.

Common Mistakes

  • Using str_split() on UTF-8 strings — it splits by byte, producing broken multibyte sequences; use mb_str_split() (PHP 7.4+).
  • Passing encoding parameter inconsistently — mb_substr($s, 0, 10, 'UTF-8') when mb_internal_encoding is already set adds noise; rely on the default after setting it once.
  • Forgetting preg_match() with UTF-8 — use the /u modifier for Unicode-aware regex matching: preg_match('/\p{L}+/u', $text).
  • Relying on mb_strtolower() for locale-sensitive comparison — Turkish has dotted/dotless i rules that require the Intl Collator, not mb_string.

Code Examples

✗ Vulnerable
<?php
// ❌ Byte-based functions on UTF-8 strings
$name = 'Ångström'; // 8 characters, 11 bytes

echo strlen($name);          // 11, not 8
echo strtoupper($name);      // ÅNGSTRöM — ö not uppercased
echo substr($name, 0, 3);    // Ã\x85 — corrupted (splits UTF-8 sequence)

// Truncating a user bio at 100 chars
$bio = $user['bio'];
$preview = substr($bio, 0, 100); // May split a multibyte character at byte 100
✓ Fixed
<?php
// ✅ mb_string functions — character-aware
mb_internal_encoding('UTF-8');

$name = 'Ångström';

echo mb_strlen($name);          // 8
echo mb_strtoupper($name);      // ÅNGSTRÖM
echo mb_substr($name, 0, 3);    // Ång — correct

// Safe truncation
$preview = mb_strlen($bio) > 100
    ? mb_substr($bio, 0, 100) . '…'
    : $bio;

Added 23 Mar 2026
Views 26
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings T 1 ping F 0 pings S 1 ping S 0 pings M 0 pings T 1 ping W 0 pings T 2 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 1 ping T 1 ping F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 2 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 0 pings F
No pings yet today
No pings yesterday
Amazonbot 10 Perplexity 3 Google 3 ChatGPT 1 Meta AI 1 Ahrefs 1
crawler 17 crawler_json 2
DEV INTEL Tools & Severity
⚙ Fix effort: Low
⚡ Quick Fix
Add 'mb_internal_encoding('UTF-8');' to your bootstrap. Then do a project-wide search for strlen, substr, strpos, strtolower, strtoupper, str_split used on potentially multibyte strings and prefix each with 'mb_'.
📦 Applies To
PHP 4.0+ web cli

✓ schema.org compliant