Suffix Array
debt(d8/e6/b4/t6)
Closest to 'silent in production until users hit it' (d9), reduced to d8 because detection_hints.automated is 'no' and only a regex code_pattern exists. A naive O(n^2 log n) build works correctly on small inputs but silently times out on large texts in production; no linter catches the performance choice.
Closest to 'touches multiple files / significant refactor in one component' (e5), nudged to e6 because the quick_fix requires swapping the naive sort for prefix-doubling or SA-IS, adding LCP array computation, and introducing a caching layer for the static text — a meaningful multi-part change within the string-search component rather than a one-line patch.
Closest to 'localised tax' (b3), raised to b4 because applies_to spans cli/library/web and the suffix array plus its LCP array become a shared index that downstream query code depends on, imposing a modest persistent tax beyond a single isolated component.
Closest to 'serious trap' (t7), softened to t6 grounded in the misconception that the array stores actual suffix strings or all rotations when it only stores sorted integer positions, plus common_mistakes around confusing suffix sorting with Burrows-Wheeler rotation ordering and off-by-one binary search boundaries — the obvious mental model is wrong but learnable.
Also Known As
TL;DR
Explanation
A suffix array is the sorted list of starting indices of every suffix of a string. For a string S of length n, it holds a permutation of 0..n-1 such that the suffixes beginning at those positions are in lexicographic order. Once built, you can locate any pattern P of length m by binary searching the array in O(m log n) time, because all occurrences of P appear as a contiguous block of suffixes that share P as a prefix.
The naive construction sorts all n suffixes directly, which is O(n^2 log n) because each comparison can scan up to n characters. Better algorithms - the prefix-doubling approach (O(n log n) or O(n log^2 n) depending on the sort), DC3/skew (O(n)), or SA-IS (O(n)) - make the structure practical for large texts and genomic data.
Suffix arrays are usually paired with an LCP (longest common prefix) array, which records the length of the shared prefix between each pair of adjacent suffixes. The LCP array turns the suffix array into a tool for counting distinct substrings, finding the longest repeated substring, and speeding up pattern matching to O(m + log n).
Compared with a suffix tree, a suffix array stores the same searching power in a flat integer array. It uses roughly 4n bytes plus the text instead of the much larger node-and-pointer layout of a tree, has better cache behaviour, and is simpler to serialise. The trade-off is that some operations that are constant time on a suffix tree need the auxiliary LCP array and a little extra logic on a suffix array.
Note the naming overlap: the user described storing rotations in sorted order, which is the Burrows-Wheeler / suffix-automaton family. A classic suffix array sorts suffixes, not full rotations, though the two ideas are closely related and the BWT is derived directly from the suffix array. Reach for a suffix array when you must run many substring queries against a large, mostly static text such as a document index, log corpus, or DNA sequence.
Diagram
flowchart TD
TEXT[Text: banana$] --> SUF[Enumerate suffixes]
SUF --> SORT[Sort suffixes lexicographically]
SORT --> SA[Suffix array of start indices]
SA --> LCP[Optional LCP array]
SA --> SEARCH[Binary search pattern in O m log n]
LCP --> FAST[Speed up to O m plus log n]
style SA fill:#238636,color:#fff
style SEARCH fill:#1f6feb,color:#fff
Common Misconception
Why It Matters
Common Mistakes
- Building with the naive O(n^2 log n) sort on large inputs, causing timeouts where prefix-doubling or SA-IS would finish quickly.
- Forgetting to compute the LCP array, then re-scanning characters and losing the O(m + log n) search speed.
- Confusing suffix sorting with full-rotation (Burrows-Wheeler) sorting, which gives a different ordering for the wraparound suffixes.
- Rebuilding the array on every query instead of caching it for a static text - the whole point is to amortise construction over many lookups.
- Off-by-one errors at the binary search boundaries that miss the first or last occurrence of a pattern.
Avoid When
- The text changes frequently, since rebuilding the array on every mutation negates the preprocessing payoff.
- You only need a single substring search, where a direct scan or strpos is simpler and fast enough.
- Memory for the index plus LCP array exceeds what the environment allows for the input size.
When To Use
- A large, mostly static text receives many substring or pattern queries.
- You need full-text search, longest repeated substring, or distinct substring counts.
- You want suffix-tree query power with lower memory and better cache locality.
- Building a Burrows-Wheeler transform or FM-index for compression or genomic search.
Code Examples
// Naive O(n^2 log n): sorts whole suffix strings each comparison.
function buildSuffixArray(string $s): array {
$n = strlen($s);
$suffixes = [];
for ($i = 0; $i < $n; $i++) {
$suffixes[$i] = substr($s, $i); // copies up to n chars per suffix
}
// sort compares full strings: each compare is O(n)
asort($suffixes);
return array_keys($suffixes);
}
// For a 1M-char log this allocates O(n^2) bytes and crawls.
// Prefix-doubling: O(n log^2 n) time, integer ranks, no string copies.
function buildSuffixArray(string $s): array {
$n = strlen($s);
$sa = range(0, $n - 1);
// initial rank = char code, then compacted to dense ranks each round
$rank = array_map('ord', str_split($s));
$tmp = array_fill(0, $n, 0);
for ($k = 1; ; $k <<= 1) {
$cmp = function (int $a, int $b) use (&$rank, $k, $n): int {
if ($rank[$a] !== $rank[$b]) return $rank[$a] <=> $rank[$b];
$ra = ($a + $k < $n) ? $rank[$a + $k] : -1;
$rb = ($b + $k < $n) ? $rank[$b + $k] : -1;
return $ra <=> $rb;
};
usort($sa, $cmp);
$tmp[$sa[0]] = 0;
for ($i = 1; $i < $n; $i++) {
$tmp[$sa[$i]] = $tmp[$sa[$i - 1]] + ($cmp($sa[$i - 1], $sa[$i]) ? 1 : 0);
}
$rank = $tmp;
if ($rank[$sa[$n - 1]] === $n - 1 || $k >= $n) break; // all ranks distinct
}
return $sa;
}