Character Encoding
debt(d7/e7/b7/t7)
Closest to 'only careful code review or runtime testing' (d7). The term's detection_hints list phpstan and mysql-charset-check as tools, but the canonical failure — MySQL utf8 silently dropping emoji — is not caught by standard linting. phpstan may flag mb_* misuse in some configurations, but the MySQL charset mismatch and PDO DSN issues are invisible until data is written and inspected. The silent data loss described in why_it_matters makes this a d7: it surfaces only through careful review or runtime testing with emoji-containing data.
Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix summary says 'UTF-8 everywhere': database charset=utf8mb4, PHP mb_* functions, HTML meta tag, php.ini setting, and PDO DSN. This is not a single-line patch — it requires touching database migrations/ALTER TABLE statements, PHP string-handling code throughout the codebase, configuration files, and connection setup. The common_mistakes list four distinct categories of misuse, each requiring separate remediation across multiple layers of the stack.
Closest to 'strong gravitational pull' (b7). applies_to covers both web and cli contexts (broad scope). Every string-handling operation, database interaction, and HTTP response is subject to encoding correctness. The choice of charset permeates database schema, PDO configuration, PHP string functions, and HTTP headers. Any new feature touching user input or stored strings must be encoding-aware, making this a persistent, cross-cutting structural burden that shapes how every future developer must write string-handling code.
Closest to 'serious trap — contradicts how a similar concept works elsewhere' (t7). The misconception field directly states that developers conflate UTF-8 and Unicode as the same thing. More operationally dangerous is the MySQL utf8 vs utf8mb4 trap: MySQL's charset named 'utf8' is NOT full UTF-8 — it silently drops 4-byte characters (emoji). A competent developer who knows UTF-8 and trusts MySQL's 'utf8' charset label will be confidently wrong. This contradicts the reasonable expectation that a charset named 'utf8' implements the UTF-8 standard, making it a t7.
Also Known As
TL;DR
Explanation
ASCII encodes 128 characters in 7 bits. Latin-1 extends to 256 using 8 bits. UTF-8 encodes all Unicode code points (1.1M chars) using 1-4 bytes — ASCII chars use 1 byte (backwards compatible), most European chars use 2, CJK and emoji use 3-4. UTF-16 uses 2 bytes per char (4 for supplementary planes). PHP's string functions are byte-oriented — strlen('café') returns 5 not 4 in UTF-8. Use mb_strlen() for character-aware operations. MySQL's utf8 charset is actually 3-byte limited — use utf8mb4 for full Unicode including emoji.
Common Misconception
Why It Matters
Common Mistakes
- MySQL utf8 instead of utf8mb4 — utf8 only handles 3-byte chars, emoji are silently dropped.
- strlen() instead of mb_strlen() for user-facing strings — wrong character count for multibyte strings.
- substr() instead of mb_substr() — can split multibyte sequences, corrupting the string.
- Not setting charset=utf8mb4 in PDO DSN — connection charset defaults may cause mojibake.
Code Examples
// MySQL utf8 — emoji silently dropped:
CREATE TABLE users (name VARCHAR(100) CHARSET utf8);
INSERT INTO users (name) VALUES ('Alice 👋');
SELECT name FROM users; -- Returns 'Alice ' (emoji dropped!)
// PHP byte-length instead of character-length:
$name = 'café';
strlen($name); // Returns 5 (bytes), not 4 (characters)
substr($name, 0, 3); // Returns 'caf' corrupting 'é'
// MySQL utf8mb4 — full Unicode support:
CREATE TABLE users (name VARCHAR(100) CHARSET utf8mb4);
// PDO with correct charset:
$pdo = new PDO('mysql:host=db;dbname=app;charset=utf8mb4', $user, $pass);
// PHP multibyte string functions:
$name = 'café';
mb_strlen($name); // 4 (characters)
mb_substr($name, 0, 3); // 'caf' — correct
mb_strtoupper($name); // 'CAFÉ' — locale-aware