Collation & Locale-Aware Sorting
debt(d9/e3/b5/t7)
Closest to 'silent in production until users hit it' (d9). Byte-order sorting produces valid output that passes all automated tests — the code runs without errors, arrays sort without exceptions, queries execute successfully. The bug is only visible when a human user who speaks the relevant language notices that ä appears after z or ñ is in the wrong position. No linter, static analyzer, or automated test catches this unless you write explicit locale-aware test cases with known correct orderings.
Closest to 'simple parameterised fix' (e3). The quick_fix shows this is a straightforward swap: replace sort($array) with (new Collator($locale))->sort($array), or change collation from utf8mb4_general_ci to utf8mb4_unicode_ci. However, the fix may touch multiple files if sorting is spread across the codebase, and database collation changes require ALTER TABLE statements and potential reindexing, pushing slightly beyond a pure one-liner.
Closest to 'persistent productivity tax' (b5). Collation is a cross-cutting concern affecting any code that sorts or compares strings for display — user lists, product catalogs, search results, autocomplete. Once the wrong approach is established, every new sorting feature inherits the bug. The fix requires awareness at both PHP and database layers, and developers must remember to use Collator instead of sort() throughout the codebase. Not architectural-level, but a persistent tax on string handling.
Closest to 'serious trap' (t7). The misconception field states developers expect sort() and ORDER BY to produce correct alphabetical order — this contradicts the intuitive mental model from English-only development where byte order and linguistic order happen to align for ASCII. A developer coming from English contexts will confidently use strcmp() and be surprised when German or Spanish users complain. The 'obvious' approach works for English and fails silently for other languages.
Also Known As
TL;DR
Explanation
Collation defines the comparison order for strings. ASCII byte comparison (strcmp, ORDER BY in default MySQL) sorts uppercase before lowercase, treats accented characters as greater than z, and has no concept of locale-specific sort rules. For a German user, ä should sort near a; for a Swedish user, ä sorts after z. The Unicode Collation Algorithm (UCA) provides a standardised method for locale-aware sorting. PHP's intl extension provides Collator: $collator = new Collator('de_DE'); $collator->sort($array). MySQL and PostgreSQL support collation at the column or query level — utf8mb4_unicode_ci for MySQL is accent-insensitive and case-insensitive; specific locale collations like de_DE provide German sorting rules. Database-level collation affects ORDER BY, GROUP BY, and index usage; PHP-level sorting is used for in-memory arrays.
Common Misconception
Why It Matters
Common Mistakes
- Using strcmp() or sort() for locale-sensitive string comparison — both use byte order, not linguistic order.
- Setting database collation to utf8mb4_general_ci instead of utf8mb4_unicode_ci — general_ci has faster comparisons but less accurate Unicode handling.
- Not specifying collation in ORDER BY for multilingual tables — default database collation may be wrong for specific query contexts.
- Sorting in PHP after fetching from a correctly-collated database — the database should handle sorting; PHP re-sorting discards the correct database collation.
Code Examples
// Byte-order sort — wrong for any non-ASCII alphabet
$names = ['Müller', 'Maier', 'Ärger', 'Bauer'];
sort($names);
// Result: ['Bauer', 'Maier', 'Müller', 'Ärger'] — Ärger last, wrong
// Locale-aware sort — correct German alphabetical order
$names = ['Müller', 'Maier', 'Ärger', 'Bauer'];
$collator = new Collator('de_DE');
$collator->sort($names);
// Result: ['Ärger', 'Bauer', 'Maier', 'Müller'] — correct
// MySQL — locale-aware ORDER BY
// ALTER TABLE products MODIFY name VARCHAR(255)
// CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
// SELECT * FROM products ORDER BY name COLLATE utf8mb4_unicode_ci;