How do I fix Character Encoding?

Use UTF-8 everywhere: database charset=utf8mb4, PHP mb_* functions, HTML — mbstring.internal_encoding=UTF-8 in php.ini ensures mb_* defaults are correct

← Back to glossary

Character Encoding

i18n PHP 5.0+ Intermediate

debt(d7/e7/b7/t7)

d7 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'only careful code review or runtime testing' (d7). The term's detection_hints list phpstan and mysql-charset-check as tools, but the canonical failure — MySQL utf8 silently dropping emoji — is not caught by standard linting. phpstan may flag mb_* misuse in some configurations, but the MySQL charset mismatch and PDO DSN issues are invisible until data is written and inspected. The silent data loss described in why_it_matters makes this a d7: it surfaces only through careful review or runtime testing with emoji-containing data.

e7 Effort Remediation debt — work required to fix once spotted

Closest to 'cross-cutting refactor across the codebase' (e7). The quick_fix summary says 'UTF-8 everywhere': database charset=utf8mb4, PHP mb_* functions, HTML meta tag, php.ini setting, and PDO DSN. This is not a single-line patch — it requires touching database migrations/ALTER TABLE statements, PHP string-handling code throughout the codebase, configuration files, and connection setup. The common_mistakes list four distinct categories of misuse, each requiring separate remediation across multiple layers of the stack.

b7 Burden Structural debt — long-term weight of choosing wrong

Closest to 'strong gravitational pull' (b7). applies_to covers both web and cli contexts (broad scope). Every string-handling operation, database interaction, and HTTP response is subject to encoding correctness. The choice of charset permeates database schema, PDO configuration, PHP string functions, and HTTP headers. Any new feature touching user input or stored strings must be encoding-aware, making this a persistent, cross-cutting structural burden that shapes how every future developer must write string-handling code.

t7 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'serious trap — contradicts how a similar concept works elsewhere' (t7). The misconception field directly states that developers conflate UTF-8 and Unicode as the same thing. More operationally dangerous is the MySQL utf8 vs utf8mb4 trap: MySQL's charset named 'utf8' is NOT full UTF-8 — it silently drops 4-byte characters (emoji). A competent developer who knows UTF-8 and trusts MySQL's 'utf8' charset label will be confidently wrong. This contradicts the reasonable expectation that a charset named 'utf8' implements the UTF-8 standard, making it a t7.

About DEBT scoring → scored by claude-sonnet-4-6 · 2026-05-06 · reviewed by human

Also Known As

UTF-8 Unicode ASCII encoding utf8mb4

TL;DR

How text is stored as bytes — ASCII (128 chars), Latin-1 (256 chars), UTF-8 (1-4 bytes, backwards compatible), and UTF-16 are the key encodings developers encounter.

Explanation

ASCII encodes 128 characters in 7 bits. Latin-1 extends to 256 using 8 bits. UTF-8 encodes all Unicode code points (1.1M chars) using 1-4 bytes — ASCII chars use 1 byte (backwards compatible), most European chars use 2, CJK and emoji use 3-4. UTF-16 uses 2 bytes per char (4 for supplementary planes). PHP's string functions are byte-oriented — strlen('café') returns 5 not 4 in UTF-8. Use mb_strlen() for character-aware operations. MySQL's utf8 charset is actually 3-byte limited — use utf8mb4 for full Unicode including emoji.

Common Misconception

✗ UTF-8 and Unicode are the same thing — Unicode is the character set (defining code points); UTF-8 is one encoding of Unicode (the byte representation). UTF-16 and UTF-32 are other Unicode encodings.

Why It Matters

MySQL's utf8 column type silently truncates emoji (4-byte UTF-8) — a user's name containing an emoji is stored without it, causing data loss that is invisible to PHP code.

Common Mistakes

MySQL utf8 instead of utf8mb4 — utf8 only handles 3-byte chars, emoji are silently dropped.
strlen() instead of mb_strlen() for user-facing strings — wrong character count for multibyte strings.
substr() instead of mb_substr() — can split multibyte sequences, corrupting the string.
Not setting charset=utf8mb4 in PDO DSN — connection charset defaults may cause mojibake.

Code Examples

✗ Vulnerable

// MySQL utf8 — emoji silently dropped:
CREATE TABLE users (name VARCHAR(100) CHARSET utf8);
INSERT INTO users (name) VALUES ('Alice 👋');
SELECT name FROM users; -- Returns 'Alice ' (emoji dropped!)

// PHP byte-length instead of character-length:
$name = 'café';
strlen($name);    // Returns 5 (bytes), not 4 (characters)
substr($name, 0, 3); // Returns 'caf' corrupting 'é'

✓ Fixed

// MySQL utf8mb4 — full Unicode support:
CREATE TABLE users (name VARCHAR(100) CHARSET utf8mb4);

// PDO with correct charset:
$pdo = new PDO('mysql:host=db;dbname=app;charset=utf8mb4', $user, $pass);

// PHP multibyte string functions:
$name = 'café';
mb_strlen($name);        // 4 (characters)
mb_substr($name, 0, 3);  // 'caf' — correct
mb_strtoupper($name);    // 'CAFÉ' — locale-aware

References

↗ https://www.php.net/manual/en/book.mbstring.php

Tags

Added 16 Mar 2026

Edited 22 Mar 2026

Curated in Warsaw under one editorial standard. 1,445 terms, single voice. About this reference →

Rate this term

No ratings yet

🤖 AI Guestbook educational data only

| |

Last 30 days

Agents 1

Amazonbot 10 Perplexity 9 Ahrefs 2 ChatGPT 1 Google 1

Also referenced

Intl API 32 PHP Intl Extension 26 Pluralisation Rules Across Languages 24

How they use it

crawler 22 crawler_json 1

Related categories

javascript 2.3k i18n 347

⚡ DEV INTEL Tools & Severity

🟠 High ⚙ Fix effort: Medium

⚡ Quick Fix

Use UTF-8 everywhere: database charset=utf8mb4, PHP mb_* functions, HTML <meta charset='UTF-8'> — mbstring.internal_encoding=UTF-8 in php.ini ensures mb_* defaults are correct

📦 Applies To

PHP 5.0+ web cli

🔗 Prerequisites

String Interpolation & Heredoc/Nowdoc PHP Data Types Security Misconfiguration

🔍 Detection Hints

MySQL charset=utf8 (not utf8mb4 — cannot store emoji); strlen() on multibyte string instead of mb_strlen(); iconv() conversion without error handling

Auto-detectable: ✓ Yes phpstan mysql-charset-check

⚠ Related Problems

String Interpolation & Heredoc/Nowdoc Security Misconfiguration data corruption

🤖 AI Agent

Confidence: High False Positives: Medium ✗ Manual fix Fix: Medium Context: File Tests: Update

CWE-116 CWE-838