← Back to glossary

mb_string — Multibyte String Functions

php PHP 4.0+ Intermediate

debt(d7/e3/b5/t7)

d7 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'only careful code review or runtime testing' (d7). No detection_hints.tools specified for this term. Using native strlen() on UTF-8 strings won't trigger any compiler or linter error — the code runs fine until you see garbled output with international characters. Static analysis tools like PHPStan or Psalm don't flag strlen() vs mb_strlen() misuse by default. Only manual code review or runtime testing with multibyte input reveals the problem.

e3 Effort Remediation debt — work required to fix once spotted

Closest to 'simple parameterised fix' (e3). The quick_fix describes setting mb_internal_encoding('UTF-8') once and doing a project-wide search-replace of strlen→mb_strlen, substr→mb_substr, etc. This is a mechanical find-and-replace pattern across multiple files but follows a predictable transformation — not architectural, but more than a one-line patch.

b5 Burden Structural debt — long-term weight of choosing wrong

Closest to 'persistent productivity tax' (b5). The term applies_to all PHP contexts (web, cli) and affects any string operation on user input. Every developer working on the codebase must remember to use mb_* functions instead of native ones for all new code. This creates a persistent cognitive tax across the team — not quite architectural (you can adopt it incrementally), but it shapes how all string handling must be done.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap' (t7). The misconception explicitly states that developers believe 'mb_internal_encoding("UTF-8") makes all string functions Unicode-safe' — this contradicts how similar global encoding settings work in other languages where they affect all string operations. PHP's design where native functions remain byte-oriented regardless of encoding settings contradicts reasonable expectations from developers coming from Python 3, Ruby, or other Unicode-native environments.

About DEBT scoring → scored by claude-opus-4-5-20251101 · 2026-05-11 · reviewed by human

Also Known As

mb_string multibyte string mb_strlen mb_substr mbstring

TL;DR

PHP's native string functions operate on bytes, not characters. The mb_string extension provides mb_strlen(), mb_substr(), mb_strtolower() and 100+ equivalents that correctly handle multibyte encodings like UTF-8.

Explanation

A UTF-8 encoded character can occupy 1 to 4 bytes. PHP's strlen() counts bytes — 'héllo' is 6 bytes (é is 2 bytes in UTF-8) not 5 characters. substr(), strtolower(), strtoupper(), and most string functions have the same byte-centric behaviour. The mb_string extension (enabled by default in most PHP installations) provides mb_strlen(), mb_substr(), mb_strtolower(), mb_strtoupper(), mb_strpos(), mb_convert_encoding(), and many others that are character-aware. mb_internal_encoding() sets the default encoding so you don't need to pass it to every function. The Intl extension's Normalizer and Collator classes handle Unicode normalisation and locale-aware comparison beyond what mb_string covers.

Common Misconception

✗ Setting mb_internal_encoding('UTF-8') makes all string functions Unicode-safe. It only affects mb_* functions, not the native strlen(), substr(), etc. You must explicitly use the mb_ variants.

Why It Matters

Every PHP application that handles user input in languages other than ASCII — which is most applications globally — needs mb_string. Truncating a UTF-8 string with substr() at a byte boundary mid-character produces corrupted output. Sorting or comparing strings with strcasecmp() ignores locale rules. These are not edge cases — they affect any application with international users.

Common Mistakes

Using str_split() on UTF-8 strings — it splits by byte, producing broken multibyte sequences; use mb_str_split() (PHP 7.4+).
Passing encoding parameter inconsistently — mb_substr($s, 0, 10, 'UTF-8') when mb_internal_encoding is already set adds noise; rely on the default after setting it once.
Forgetting preg_match() with UTF-8 — use the /u modifier for Unicode-aware regex matching: preg_match('/\p{L}+/u', $text).
Relying on mb_strtolower() for locale-sensitive comparison — Turkish has dotted/dotless i rules that require the Intl Collator, not mb_string.

Code Examples

✗ Vulnerable

<?php
// ❌ Byte-based functions on UTF-8 strings
$name = 'Ångström'; // 8 characters, 11 bytes

echo strlen($name);          // 11, not 8
echo strtoupper($name);      // ÅNGSTRöM — ö not uppercased
echo substr($name, 0, 3);    // Ã\x85 — corrupted (splits UTF-8 sequence)

// Truncating a user bio at 100 chars
$bio = $user['bio'];
$preview = substr($bio, 0, 100); // May split a multibyte character at byte 100

✓ Fixed

<?php
// ✅ mb_string functions — character-aware
mb_internal_encoding('UTF-8');

$name = 'Ångström';

echo mb_strlen($name);          // 8
echo mb_strtoupper($name);      // ÅNGSTRÖM
echo mb_substr($name, 0, 3);    // Ång — correct

// Safe truncation
$preview = mb_strlen($bio) > 100
    ? mb_substr($bio, 0, 100) . '…'
    : $bio;

mb_string — Multibyte String Functions

Also Known As

TL;DR

Explanation

Common Misconception

Why It Matters

Common Mistakes

Code Examples

References

Tags

mb_string — Multibyte String Functions

Also Known As

TL;DR

Explanation

Common Misconception

Why It Matters

Common Mistakes

Code Examples

References

Tags

Related Terms