MySQL charset=utf8mb4
debt(d5/e5/b5/t9)
Closest to 'specialist tool catches it' (d5). The detection_hints list semgrep as the tool, with a specific pattern for charset=utf8 in DSN or SET NAMES utf8 without mb4. This is not caught by the compiler or default linters, but a configured semgrep rule can find it. Scores exactly d5.
Closest to 'touches multiple files / significant refactor in one component' (e5). The quick_fix says 'Use charset=utf8mb4 in the DSN and ALTER TABLE columns to utf8mb4_unicode_ci collation' — the DSN fix is a one-liner but the ALTER TABLE on existing columns (potentially many tables/columns) plus ensuring the connection-level setting is consistent across the codebase makes this a multi-file, multi-step effort. Scores e5.
Closest to 'persistent productivity tax' (b5). The charset applies to all web and cli contexts (per applies_to). Mixing utf8 and utf8mb4 columns causes ongoing collation errors in joins and comparisons, and any new table or column added must be consciously set to utf8mb4. This imposes a persistent tax across many work streams but does not define the entire system shape. Scores b5.
Closest to 'catastrophic trap' (t9). The misconception is explicit: 'MySQL's utf8 charset is the same as UTF-8. It is not.' The name 'utf8' directly contradicts its actual behavior — a competent developer reading charset=utf8 would assume they have full UTF-8 support. The trap causes silent data loss (truncation of emoji and 4-byte chars in non-strict mode) with no warning, exactly matching the 'obvious way is always wrong' anchor. Scores t9.
Also Known As
TL;DR
Explanation
MySQL's 'utf8' charset is a 3-byte encoding that cannot store 4-byte Unicode code points (emoji, some CJK characters, mathematical symbols). 'utf8mb4' is the correct implementation of UTF-8 and supports the full Unicode range. Using 'utf8' causes silent data truncation or errors when 4-byte characters are inserted. The DSN should specify charset=utf8mb4 and the column/table/database collation should be utf8mb4_unicode_ci or utf8mb4_0900_ai_ci (MySQL 8+).
Watch Out
Common Misconception
Why It Matters
Common Mistakes
- Specifying charset=utf8 in the DSN — silent truncation of emoji and supplementary characters.
- Mixing utf8 and utf8mb4 columns in the same table — comparison and join operations may have unexpected collation errors.
- Forgetting to set utf8mb4 at the connection level even when the table columns are utf8mb4.
Avoid When
- Do not use utf8 — it silently truncates or errors on emoji and supplementary Unicode characters.
When To Use
- Always use utf8mb4 for any table that may store user-generated content, names, or multilingual text.
- Set charset=utf8mb4 in the DSN — not via SET NAMES — to ensure it applies at the protocol level.
Code Examples
// Wrong: utf8 truncates emoji silently
$pdo = new PDO('mysql:host=localhost;dbname=app;charset=utf8', $user, $pass);
// INSERT 'Hello 😀' → stored as 'Hello ' (emoji silently dropped)
// Correct: utf8mb4 in DSN and SET NAMES
$pdo = new PDO('mysql:host=localhost;dbname=app;charset=utf8mb4', $user, $pass);
-- SQL: table with correct charset
CREATE TABLE posts (
id INT AUTO_INCREMENT PRIMARY KEY,
body TEXT
) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;