PHP 6 — The Version That Never Shipped
debt(d7/e3/b3/t7)
Closest to 'only careful code review or runtime testing' (d7). The misconception — using native string functions like strlen() or strtolower() on UTF-8 data — produces silent byte-level results rather than compile or lint errors. No tools are listed in detection_hints; from training knowledge, static analysers like PHPStan or Psalm can sometimes flag mb_string misuse but only with custom rules, so detection generally requires careful review or runtime testing with non-ASCII input.
Closest to 'simple parameterised fix' (e3). The quick_fix states: replace native string functions with mb_string equivalents (mb_strlen, mb_substr, mb_strtolower). This is a targeted pattern-replacement rather than a one-liner swap because it may touch multiple call sites across a component, but it does not require cross-cutting architectural changes.
Closest to 'localised tax' (b3). The burden applies wherever string handling occurs, but it is a contained, well-understood problem: once a developer knows to use mb_string functions, the fix is systematic. It does not reshape the entire codebase architecture, though it is a persistent reminder in any code dealing with user-facing text or multibyte input.
Closest to 'serious trap' (t7). The misconception field explicitly states developers expect PHP strings to behave like Unicode objects (as in Python 3 or Java), but PHP strings are byte strings. The common_mistakes reinforce this: strlen() returning byte counts, strtolower() ignoring accents, mixing mb_ and native functions — all contradict the mental model of modern language string handling. This actively contradicts how similar concepts work in other languages developers know.
Also Known As
TL;DR
Explanation
PHP 6 development began in 2005 with one primary goal: native Unicode support throughout the entire language and standard library. Every string operation would understand multibyte characters natively, ending years of mb_string workarounds. The branch lingered for five years. The core problem was that making every string operation Unicode-aware required changes across thousands of internal functions, and the performance impact was severe — benchmarks showed 20–50% slowdowns for code that didn't even use Unicode. By 2010, the core team voted to abandon the branch. The valuable non-Unicode features that had been developed — namespaces, late static binding, closures, and goto — were backported to PHP 5.3. The version number 6 was skipped entirely to avoid confusion with the abandoned branch and the two books already published about it. PHP 7 arrived in 2015.
Common Misconception
Why It Matters
Common Mistakes
- Using strlen() on UTF-8 strings and getting byte counts instead of character counts — leads to truncation bugs with multibyte characters.
- Assuming strtolower() / strtoupper() handle accented characters — they don't; use mb_strtolower() with a locale.
- Mixing mb_string and native string functions on the same variable — substr() after mb_substr() can corrupt multibyte sequences.
- Expecting PHP to behave like Python 3 or Java where strings are Unicode objects by default — PHP strings are byte strings.
Code Examples
// ❌ Assuming native string functions are Unicode-safe — they aren't
$str = 'héllo';
echo strlen($str); // 6, not 5 — counts bytes not characters
echo strtoupper($str); // HéLLO — fails on non-ASCII
echo substr($str, 0, 3); // Hé\x (corrupts the multibyte é)
// ✅ Use mb_string for Unicode-safe string operations
$str = 'héllo';
echo mb_strlen($str); // 5 — character count
echo mb_strtoupper($str); // HÉLLO — correct
echo mb_substr($str, 0, 3); // hél — safe
// Or set the default encoding once at bootstrap
mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');