← CodeClarityLab Home
Browse by Category
+ added · updated 7d
← Back to glossary

Character Encoding

i18n PHP 5.0+ Intermediate
debt(d7/e7/b7/t7)
d7 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'only careful code review or runtime testing' (d7). The term's detection_hints list phpstan and mysql-charset-check as tools, but the canonical failure — MySQL utf8 silently dropping emoji — is not caught by standard linting. phpstan may flag mb_* misuse in some configurations, but the MySQL charset mismatch and PDO DSN issues are invisible until data is written and inspected. The silent data loss described in why_it_matters makes this a d7: it surfaces only through careful review or runtime testing with emoji-containing data.

e7 Effort Remediation debt — work required to fix once spotted

Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix summary says 'UTF-8 everywhere': database charset=utf8mb4, PHP mb_* functions, HTML meta tag, php.ini setting, and PDO DSN. This is not a single-line patch — it requires touching database migrations/ALTER TABLE statements, PHP string-handling code throughout the codebase, configuration files, and connection setup. The common_mistakes list four distinct categories of misuse, each requiring separate remediation across multiple layers of the stack.

b7 Burden Structural debt — long-term weight of choosing wrong

Closest to 'strong gravitational pull' (b7). applies_to covers both web and cli contexts (broad scope). Every string-handling operation, database interaction, and HTTP response is subject to encoding correctness. The choice of charset permeates database schema, PDO configuration, PHP string functions, and HTTP headers. Any new feature touching user input or stored strings must be encoding-aware, making this a persistent, cross-cutting structural burden that shapes how every future developer must write string-handling code.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap — contradicts how a similar concept works elsewhere' (t7). The misconception field directly states that developers conflate UTF-8 and Unicode as the same thing. More operationally dangerous is the MySQL utf8 vs utf8mb4 trap: MySQL's charset named 'utf8' is NOT full UTF-8 — it silently drops 4-byte characters (emoji). A competent developer who knows UTF-8 and trusts MySQL's 'utf8' charset label will be confidently wrong. This contradicts the reasonable expectation that a charset named 'utf8' implements the UTF-8 standard, making it a t7.

About DEBT scoring →

Also Known As

UTF-8 Unicode ASCII encoding utf8mb4

TL;DR

How text is stored as bytes — ASCII (128 chars), Latin-1 (256 chars), UTF-8 (1-4 bytes, backwards compatible), and UTF-16 are the key encodings developers encounter.

Explanation

ASCII encodes 128 characters in 7 bits. Latin-1 extends to 256 using 8 bits. UTF-8 encodes all Unicode code points (1.1M chars) using 1-4 bytes — ASCII chars use 1 byte (backwards compatible), most European chars use 2, CJK and emoji use 3-4. UTF-16 uses 2 bytes per char (4 for supplementary planes). PHP's string functions are byte-oriented — strlen('café') returns 5 not 4 in UTF-8. Use mb_strlen() for character-aware operations. MySQL's utf8 charset is actually 3-byte limited — use utf8mb4 for full Unicode including emoji.

Common Misconception

UTF-8 and Unicode are the same thing — Unicode is the character set (defining code points); UTF-8 is one encoding of Unicode (the byte representation). UTF-16 and UTF-32 are other Unicode encodings.

Why It Matters

MySQL's utf8 column type silently truncates emoji (4-byte UTF-8) — a user's name containing an emoji is stored without it, causing data loss that is invisible to PHP code.

Common Mistakes

  • MySQL utf8 instead of utf8mb4 — utf8 only handles 3-byte chars, emoji are silently dropped.
  • strlen() instead of mb_strlen() for user-facing strings — wrong character count for multibyte strings.
  • substr() instead of mb_substr() — can split multibyte sequences, corrupting the string.
  • Not setting charset=utf8mb4 in PDO DSN — connection charset defaults may cause mojibake.

Code Examples

✗ Vulnerable
// MySQL utf8 — emoji silently dropped:
CREATE TABLE users (name VARCHAR(100) CHARSET utf8);
INSERT INTO users (name) VALUES ('Alice 👋');
SELECT name FROM users; -- Returns 'Alice ' (emoji dropped!)

// PHP byte-length instead of character-length:
$name = 'café';
strlen($name);    // Returns 5 (bytes), not 4 (characters)
substr($name, 0, 3); // Returns 'caf' corrupting 'é'
✓ Fixed
// MySQL utf8mb4 — full Unicode support:
CREATE TABLE users (name VARCHAR(100) CHARSET utf8mb4);

// PDO with correct charset:
$pdo = new PDO('mysql:host=db;dbname=app;charset=utf8mb4', $user, $pass);

// PHP multibyte string functions:
$name = 'café';
mb_strlen($name);        // 4 (characters)
mb_substr($name, 0, 3);  // 'caf' — correct
mb_strtoupper($name);    // 'CAFÉ' — locale-aware

Added 16 Mar 2026
Edited 22 Mar 2026
Views 28
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings F 0 pings S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 1 ping F 0 pings S 1 ping S 0 pings M 0 pings T 0 pings W 0 pings T 1 ping F 1 ping S 0 pings S 0 pings M 0 pings T 0 pings W 0 pings T 1 ping F 0 pings S 0 pings S 0 pings M 1 ping T 0 pings W 0 pings T 1 ping F 1 ping S
Amazonbot 10 Perplexity 9 Ahrefs 2 ChatGPT 1 Google 1
crawler 22 crawler_json 1
DEV INTEL Tools & Severity
🟠 High ⚙ Fix effort: Medium
⚡ Quick Fix
Use UTF-8 everywhere: database charset=utf8mb4, PHP mb_* functions, HTML <meta charset='UTF-8'> — mbstring.internal_encoding=UTF-8 in php.ini ensures mb_* defaults are correct
📦 Applies To
PHP 5.0+ web cli
🔗 Prerequisites
🔍 Detection Hints
MySQL charset=utf8 (not utf8mb4 — cannot store emoji); strlen() on multibyte string instead of mb_strlen(); iconv() conversion without error handling
Auto-detectable: ✓ Yes phpstan mysql-charset-check
⚠ Related Problems
🤖 AI Agent
Confidence: High False Positives: Medium ✗ Manual fix Fix: Medium Context: File Tests: Update
CWE-116 CWE-838

✓ schema.org compliant