← Home ← Codex ← DEBT
Browse by Category
+ added · updated 7d
← Back to glossary

MySQL charset=utf8mb4

PHP PHP 5.1+ Beginner
debt(d5/e5/b5/t9)
d5 Detectability Operational debt — how invisible misuse is to your safety net

Closest to 'specialist tool catches it' (d5). The detection_hints list semgrep as the tool, with a specific pattern for charset=utf8 in DSN or SET NAMES utf8 without mb4. This is not caught by the compiler or default linters, but a configured semgrep rule can find it. Scores exactly d5.

e5 Effort Remediation debt — work required to fix once spotted

Closest to 'touches multiple files / significant refactor in one component' (e5). The quick_fix says 'Use charset=utf8mb4 in the DSN and ALTER TABLE columns to utf8mb4_unicode_ci collation' — the DSN fix is a one-liner but the ALTER TABLE on existing columns (potentially many tables/columns) plus ensuring the connection-level setting is consistent across the codebase makes this a multi-file, multi-step effort. Scores e5.

b5 Burden Structural debt — long-term weight of choosing wrong

Closest to 'persistent productivity tax' (b5). The charset applies to all web and cli contexts (per applies_to). Mixing utf8 and utf8mb4 columns causes ongoing collation errors in joins and comparisons, and any new table or column added must be consciously set to utf8mb4. This imposes a persistent tax across many work streams but does not define the entire system shape. Scores b5.

t9 Trap Cognitive debt — how counter-intuitive correct behaviour is

Closest to 'catastrophic trap' (t9). The misconception is explicit: 'MySQL's utf8 charset is the same as UTF-8. It is not.' The name 'utf8' directly contradicts its actual behavior — a competent developer reading charset=utf8 would assume they have full UTF-8 support. The trap causes silent data loss (truncation of emoji and 4-byte chars in non-strict mode) with no warning, exactly matching the 'obvious way is always wrong' anchor. Scores t9.

About DEBT scoring →

Also Known As

utf8mb4 MySQL UTF-8 emoji MySQL 4-byte unicode MySQL

TL;DR

The correct MySQL character set for full Unicode support — including emoji and supplementary characters that the older utf8 charset cannot store.

Explanation

MySQL's 'utf8' charset is a 3-byte encoding that cannot store 4-byte Unicode code points (emoji, some CJK characters, mathematical symbols). 'utf8mb4' is the correct implementation of UTF-8 and supports the full Unicode range. Using 'utf8' causes silent data truncation or errors when 4-byte characters are inserted. The DSN should specify charset=utf8mb4 and the column/table/database collation should be utf8mb4_unicode_ci or utf8mb4_0900_ai_ci (MySQL 8+).

Watch Out

MySQL's 'utf8' charset is NOT real UTF-8 — it is a 3-byte subset. Only 'utf8mb4' is full UTF-8.

Common Misconception

MySQL's utf8 charset is the same as UTF-8. It is not — MySQL utf8 is a 3-byte subset. Only utf8mb4 is true UTF-8.

Why It Matters

Storing emoji, multilingual content, or any 4-byte Unicode character in a utf8 column either silently truncates the string or throws an error — data loss without any warning in strict mode off.

Common Mistakes

  • Specifying charset=utf8 in the DSN — silent truncation of emoji and supplementary characters.
  • Mixing utf8 and utf8mb4 columns in the same table — comparison and join operations may have unexpected collation errors.
  • Forgetting to set utf8mb4 at the connection level even when the table columns are utf8mb4.

Avoid When

  • Do not use utf8 — it silently truncates or errors on emoji and supplementary Unicode characters.

When To Use

  • Always use utf8mb4 for any table that may store user-generated content, names, or multilingual text.
  • Set charset=utf8mb4 in the DSN — not via SET NAMES — to ensure it applies at the protocol level.

Code Examples

✗ Vulnerable
// Wrong: utf8 truncates emoji silently
$pdo = new PDO('mysql:host=localhost;dbname=app;charset=utf8', $user, $pass);
// INSERT 'Hello 😀' → stored as 'Hello ' (emoji silently dropped)
✓ Fixed
// Correct: utf8mb4 in DSN and SET NAMES
$pdo = new PDO('mysql:host=localhost;dbname=app;charset=utf8mb4', $user, $pass);

-- SQL: table with correct charset
CREATE TABLE posts (
    id   INT AUTO_INCREMENT PRIMARY KEY,
    body TEXT
) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Added 31 Mar 2026
Views 63
Rate this term
No ratings yet
🤖 AI Guestbook educational data only
| |
Last 30 days
0 pings T 0 pings W 1 ping T 0 pings F 1 ping S 0 pings S 0 pings M 2 pings T 1 ping W 1 ping T 0 pings F 2 pings S 4 pings S 2 pings M 0 pings T 0 pings W 0 pings T 0 pings F 1 ping S 0 pings S 0 pings M 2 pings T 0 pings W 0 pings T 0 pings F 1 ping S 0 pings S 0 pings M 0 pings T 1 ping W
Claude 1
No pings yesterday
Perplexity 7 Scrapy 7 Google 5 ChatGPT 5 Ahrefs 3 SEMrush 3 Unknown AI 2 Meta AI 2 Claude 2 Bing 1 Sogou 1 Majestic 1 PetalBot 1
crawler 38 crawler_json 2
DEV INTEL Tools & Severity
🟡 Medium ⚙ Fix effort: Low
⚡ Quick Fix
Use charset=utf8mb4 in the DSN and ALTER TABLE columns to utf8mb4_unicode_ci collation
📦 Applies To
PHP 5.1+ web cli
🔗 Prerequisites
🔍 Detection Hints
charset=utf8 in DSN or SET NAMES utf8 without mb4
Auto-detectable: ✓ Yes semgrep
⚠ Related Problems
🤖 AI Agent
Confidence: High False Positives: Low ✓ Auto-fixable Fix: Low Context: Line


✓ schema.org compliant